Text-to-video is the latest AI-generated synthetic media format to make headlines. Meta’s Make-A-Video and Google’s Imagen offer some impressive short video outputs based on natural language prompts. However, the operative word is short.
There is another advanced text-to-video research project underway at Phenaki that is focused on creating longer videos. An example of a short text-to-video output is the 6-second video below created by Google Imagen.
Video credit: Google Imagen
Prompt used:
Drone flythrough of a tropical jungle in snow.
The video clip above is about the extent of what you can get today from Imagen based on the examples provided to date. Phenaki offers the opportunity to input longer prompts that create a series of disparate scenes. The first video below is 22 seconds, followed by 2-minute and 2 ½ minute examples.
Prompts used:
A teddy bear diving in the ocean
A teddy bear emerges from the water
A teddy bear walks on the beach
Camera zooms out to the teddy bear in the campfire by the beach
Video credit: Phenaki
Prompts used:
Lots of traffic in futuristic city. An alien spaceship arrives to the futuristic city. The camera gets inside the alien spaceship. The camera moves forward until showing an astronaut in the blue room. The astronaut is typing in the keyboard. The camera moves away from the astronaut. The astronaut leaves the keyboard and walks to the left. The astronaut leaves the keyboard and walks away. The camera moves beyond the astronaut and looks at the screen. The screen behind the astronaut displays fish swimming in the sea. Crash zoom into the blue fish. We follow the blue fish as it swims in the dark ocean. The camera points up to the sky through the water. The ocean and the coastline of a futuristic city. Crash zoom towards a futuristic skyscraper. The camera zooms into one of the many windows. We are in an office room with empty desks. A lion runs on top of the office desks. The camera zooms into the lion's face, inside the office. Zoom out to the lion wearing a dark suit in an office room. The lion wearing looks at the camera and smiles. The camera zooms out slowly to the skyscraper exterior. Timelapse of sunset in the modern city
Video credit: Phenaki
Prompts used:
First person view of riding a motorcycle through a busy street. First person view of riding a motorcycle through a busy road in the woods. First person view of very slowly riding a motorcycle in the woods. First person view braking in a motorcycle in the woods. Running through the woods. First person view of running through the woods towards a beautiful house. First person view of running towards a large house. Running through houses between the cats. The backyard becomes empty. An elephant walks into the backyard. The backyard becomes empty. A robot walks into the backyard. A robot dances tango. First person view of running between houses with robots. First person view of running between houses; in the horizon, a lighthouse. First person view of flying on the sea over the ships. Zoom towards the ship. Zoom out quickly to show the coastal city. Zoom out quickly from the coastal city.
Video credit: Phenaki
About the Videos
You will notice that the prompts are very extensive for the longer videos. They are also comprised of a series of non-sequiturs. The prompt variety in the 2-minute videos shows off the versatility of Phenaki’s model but may not be a great demonstration of how the technology would be used in practice. I assume you could have a more coherent storyline as well. The architecture design suggests this will work, and the teddy bear swimming in the ocean and then emerging onto the beach fits that model. It is just a much shorter example.
You will also note that the video quality is much lower for Phenaki than for Imagen. More upscaling of video quality requires more computational resources, which has cost and processing time implications. Another limiting factor is the availability of high quality video data. Each of these constraints is likely to lessen over time, but they are constraints.
Phenaki’s research paper mentions that “In essence, videos are just a sequence of images, but this does not mean that generating a long coherent video is easy. In practice, it is a significantly harder task because there is much less high quality data available and the computational requirements are much more severe. For image generation, there are datasets with billions of image-text pairs (such as LAION-5B and JFT4B) while the text-video datasets are substantially smaller e.g. WebVid with ∼10M videos, which is not enough given the higher complexity of open domain videos.
“To make the matters worse, one can argue that a single short text prompt is not sufficient to provide a complete description of a video (except for short clips), and instead, a generated video must be conditioned on a sequence of prompts, or a story, which narrates what happens over time. Ideally, a video generation model must be able to generate videos of arbitrary length, all the while having the capability of conditioning the generated frames at time t on prompts at time t that can vary over time. Such capability can clearly distinguish the video from a ‘moving image’ and open up the way to real-world creative applications in art, design and content creation.”
About the Phenaki Research
Phenaki’s research paper discusses the key limitations and how their model addresses them. It also suggests what makes the model output novel.
Generating videos from text is particularly challenging due to the computational cost, limited quantities of high quality text-video data and variable length of videos. To address these issues, we introduce a new model for learning video representation which compresses the video to a small representation of discrete tokens. This tokenizer uses causal attention in time, which allows it to work with variable-length videos. To generate video tokens from text we are using a bidirectional masked transformer conditioned on pre-computed text tokens. The generated video tokens are subsequently de-tokenized to create the actual video. To address data issues, we demonstrate how joint training on a large corpus of image-text pairs as well as a smaller number of video-text examples can result in generalization beyond what is available in the video datasets. Compared to the previous video generation methods, Phenaki can generate arbitrary long videos conditioned on a sequence of prompts (i.e. time variable text or a story) in open domain. To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts.
It’s not just about longer videos. The time variable prompts mean you can create a story with sequential images to maintain coherence. Other text-to-video solutions demonstrated publicly do not support the concept of time sequence beyond a simple activity. The cat examples below are even more clear when it comes to implied time sequencing, while the long video examples above are explicit.
By the way, the cat images were entirely new. The model had never seen them before and created a video depicting by the sequence of images based on the sample image and text prompt.
About Phenaki
Phenaki is a bit of a mystery. The outfit has a new research paper under review for the ICLR 2023 conference. The review is a double-blind study, and the authors are listed as anonymous. However, Google CEO Sundar Pichai Tweeted earlier this month that there were “two important breakthroughs from @GoogleAI this week - Imagen Video, a new text-conditioned video diffusion model that generates 1280x768 24fps HD video. And Phenaki, a model which generates long coherent videos for a sequence of text prompts.
Pichai is clearly claiming that Google AI is behind Phenaki. However, that is the only record so far about Google’s involvement. It could be a separate company that Google has invested in, but Pichai said there are two Google AI breakthroughs while tagging the Google AI Twitter account. That suggests Phenaki is an entirely Google operation similar to Imagen.
It is also not clear why Imagen and Phanki are separate initiatives. What is clear is that both solutions have access to a lot of resources and a lot of available compute power which is required for text-to-video solutions.
What’s Next
Meta’s Make-A-Video and Google’s Imagen are producing higher fidelity AI-generated video, but we are only seeing a few seconds of the output, and they lack the time sequencing required for longer prompts and stories. This demonstration by Phenaki of multi-minute videos based on long prompts is impressive and different. It is also demonstrating technology for significantly different use cases.
Make-A-Video and Imagen today are more aligned with moving image GIF generation. It will greatly enhance the productivity of meme creators and offer new types of art. These solutions are also a step beyond what we see from text-to-image generators. However, today, they are not generating videos that depict a developing story. With that said, these solutions could add capabilities for more sophisticated time-based sequencing in the future.
Another important question is whether big tech companies such as Facebook and Google will come to dominate this synthetic media segment. Text and image generation AI models in broad use today were created by independent startups and labs. Several of those are said to be working on video and audio generation models as well. It is not clear whether video will be different, but it is notable that the big tech players are the first to make a big splash with text-to-video.
Then again, these solutions are not yet available to the public. Startups could still be first to market even if Facebook and Google are first to public demonstration. Given the constraints these companies place on themselves in terms of risk mitigation, it is likely a startup will offer the first option for a publicly accessible text-to-video service. Whoever does this will generate a lot of attention and likely court controversy at the same time. What is clear is that a new era of synthetic media has arrived.
N.B. All of the videos were created by Phenaki and can be found in their original posting locations here and here. We added them to the Voicebot YouTube channel to make them easier to share and provide persistence. If the GitHub page or website is updated, this post will still have links to the videos enabling appropriate context for the reader.