Each week we find a new topic for our readers to learn about in our AI Education column.
The AI we’ve been using thus far is like Tinker Toys or Lincoln Logs.
Who’s ready for an erector set?
Artificial intelligence is already changing the world around us so quickly, it’s hard to believe that this is only a beginning—but when we peek at the next page of generative AI, in particular, we see a technology that may profoundly alter the way we live, work and relate to each other as human beings.
Welcome to AI Education, where the erector set we’ll be discussing this week, our next page of generative AI we speak of, will be real-time video generation, which, roughly, translates to technology’s ability to create video from prompts and cues that can be code, natural language text, speech or visual in nature.
Of course, informed readers will know that this isn’t exactly a next page, as high-quality text-to-video generation is already here—in fact, thanks to models like Sora, Veo, LTX, Kling and Dream Machine, many of us already have access to some pretty sophisticated AI video generation tools. AI actress Tilly Norwood, created by UK AI studio Particle6, is causing shockwaves in the U.S. entertainment industry. Norwood is photorealistic and photogenic—and she’s less likely than a human actress to torpedo a production with a bad press junket. An Indian company has gone a step farther and already produced a feature-length movie with AI video generation. We’re talking more science than science fiction.
Like all things generative AI, these tools are likely to improve significantly in the near future—especially if the past 12 months, during which we’ve seen significant breakthroughs from commercial AI providers and labs pushing us towards real-time video generation, are any indication.
What Is Real-Time Video Generation?
To this point, video generative AI has been riddled with issues. A commonly reported one being ethical in nature—video generation is the technology used to make deepfakes, of course—but less known is that even now using AI to generate video from non-visual inputs is heavily resource intensive.
Generating video takes a lot of computational power, much more than generating blocks of text or code, or still images. That translates not just to higher energy consumption, but more time. While we’re now accustomed to having our prompts for text responses and images responded to nearly instantly, those of us trying to generate video have had to wait much, much longer. Using today’s consumer-grade desktop hardware, one second of realistic high-quality video could take twenty-four seconds or more to generate—so we could create five seconds of video in around two minutes.
Real-time video generation means that it takes five seconds of work to create five seconds of video. We do have the ability to generate video in real time as I write this today—some of the models I mentioned above are capable of doing something like creating short animations based off of a prompt and a still image—but the quality of their output is still well short of realistic. But our AI models are improving over time, in some cases becoming more efficient. Our computing technology is also steadily improving. We’re accelerating towards a day when realistic video can be generated in real time.
Why Real-Time Video Generation Works
Real-time video generation is a combination of artificial intelligence technologies. The work constructing the video itself is usually conducted by one or a combination of diffusion models, which we’ve discussed before in AI Education. Natural language processing technology not only “reads” instructions from a user’s prompt, it also creates any text or dialogue in the video.
The AI model literally builds the video frame-by-frame, taking into account the user prompt and information from previous frames. New frames are delivered so fast they appear instantaneous to users, enabling people to generate their own video on the fly without a camera or recording equipment.
But Wait, There’s More
So yes, we can make videos in real time, and we’ll be able to make realistic videos in real time at some point in the very near future. Great. That means we’ll be able to use AI video broadly in entertainment—for example, to make something like a bespoke, personalized entertainment library for us to access in lieu of streaming or broadcast content. We could use AI to help prototype fast-moving projects in industry. We’ll be able to edit video in real time with our voice without needing additional tools and skills. And yes, we’ll be able to make some convincing deep fakes, if we wish to take advantage of people.
But AI still needs to catch up in other areas to take advantage of the full power of real-time video generation. AI is going to be fooling our eyes long before it masters our brains. For example, AI falls short in creating rounded characters and engaging plots for long-form videos for television, movies and games. AI can make shots and scenes of videos, yes, but it struggles to create a series of dozens of scenes or hundreds of shots as a coherent whole. Ultimately, a human editor is needed to prevent more “AI slop” from being generated.
AI video generation, due to its time-consuming nature, still struggles with latency issues—lagging performance and slow processing means that high-quality video generation can’t be used for many augmented reality applications. Also, we can’t edit our videos on the fly. Yet. True real-time video generation means that we can manipulate our videos—and the moving images inside them—how we want, any time we want. This is coming.
AI isn’t really good yet at picking up context in longer conversations, or more subtle visual cues like body language—and thus, using real-time video generation to power an interactive AI agent, one we could have a back-and-forth discussion with on a Zoom call, for example, isn’t really ready for prime time yet. That being said, scammers are already using AI-generated video chat to try to take advantage of people. There will come a time when the pleasant figure you assume to be talking into a webcam somewhere on a distant computer is more likely an AI agent that looks and sounds like an in-the-flesh person.






