Google Demonstrates Text-to-Video Solution
Not available yet for general use due to safety concerns
Credit: Google Imagen / Prompt: Teddy bear skating in Times Square
Not to be outdone by Meta, Google Imagen Video published a demonstrtion of its text-to-video solution yesterday. Just a few months ago, Google Reserch debuted its imagen text-to-image solution. In less than half of a year, they have moved into a text-to-video diffusion model.
The work is surprisingly good.
I was also suprised by how good the text-to-text and text-to-image generators were when first released and how rapidly they improved. Somehow it seemed like video would be different. It is different, the shortcomings seem to be more pronounced, but diffusion models are still exceeding expectations.
Credit: Google Imagen / Prompt: Flying through an intense battle between pirate ships in a stormy ocean.
Credit: Google Imagen / Prompt: Coffee pouring into a cup.
Automatic Meme Generator
As I mentioned in the story about Meta’s (i.e. Facebooks’) Make-A-Video solution, these solutions are destined to increase the number of new memes by orders of magnitude. Granted, the proliferation of memes will make each meme less impactful but at least there will be more variety and, in many cases, higher quality.
Creators’ Dream
This is a dream solution for social media creators. The core content for many creators is verbal-first, while the channels they are distrubuted on are visual-first. And, all of the platforms are all biasing towards video because that is what drives views, view time, and engagement.
Given that these creators are often looking to spice up the visuals in their streams, the text-to-video generators will be a way to do that without incurring royalty expenses, without a significant time investment, and with a custom message depicted in the visuals that aligns with the segment.
A Cascade of Diffusion
Google Research’s abstract from the academic research paper that outlines the core hypothesis and methodology characterized the solution as:
We present Imagen Video, a text-conditional video generation system based on a cascade of video diffusion models. Given a text prompt, Imagen Video generates high definition videos using a base video generation model and a sequence of interleaved spatial and temporal video super-resolution models. We describe how we scale up the system as a high definition text-to-video model including design decisions such as the choice of fully-convolutional temporal and spatial super-resolution models at certain resolutions, and the choice of the v-parameterization of diffusion models. In addition, we confirm and transfer findings from previous work on diffusion-based image generation to the video generation setting. Finally, we apply progressive distillation to our video models with classifier-free guidance for fast, high quality sampling. We find Imagen Video not only capable of generating videos of high fidelity, but also having a high degree of controllability and world knowledge, including the ability to generate diverse videos and text animations in various artistic styles and with 3D object understanding...
Imagen Video generates high resolution videos with Cascaded Diffusion Models. The first step is to take an input text prompt and encode it into textual embeddings with a T5 text encoder. A base Video Diffusion Model then generates a 16 frame video at 24×48 resolution and 3 frames per second; this is then followed by multiple Temporal Super-Resolution (TSR) and Spatial Super-Resolution (SSR) models to upsample and generate a final 128 frame video at 1280×768 resolution and 24 frames per second -- resulting in 5.3s of high definition video!
Not Available
Most people won’t care about how this is done. They will just want access so they can use it. However, simialr to Make-A-Video, Imagen Video is not available for general use. The reason is wht everyone now calls trust & safety.
Generative modeling has made tremendous progress, especially in recent text-to-image models. Imagen Video is another step forward in generative modelling capabilities, advancing text-to-video AI systems. Video generative models can be used to positively impact society, for example by amplifying and augmenting human creativity. However, these generative models may also be misused, for example to generate fake, hateful, explicit or harmful content. We have taken multiple steps to minimize these concerns, for example in internal trials, we apply input text prompt filtering, and output video content filtering. However, there are several important safety and ethical challenges remaining. Imagen Video and its frozen T5-XXL text encoder were trained on problematic data. While our internal testing suggest much of explicit and violent content can be filtered out, there still exists social biases and stereotypes which are challenging to detect and filter. We have decided not to release the Imagen Video model or its source code until these concerns are mitigated.
Google Research knows the image set contents used for training and has seen the outputs. That team is reluctant to face the wrath of the PR and social media misstep if a video violates the trust and safety guidelines. It may also not be quite ready for prime time. So, all you have for now is the demonstration. With that said, consumer use seems inevitable in the near term.
Credit: Google Imagen / Prompt: Astronaut riding a horse.
Credit: Google Imagen / Prompt: Drone flythrough of a tropical jungle in snow.
Text-to-X Era Continues
Synthetic media generators are very compelling. They are also well liked by large groups of users. Collectively, Text-to-X solutions such s text-to-video are proliferating. Many companies were founded and raised significant investor funding in the text-to-text solution space. Several of those are simply repackaging GPT-3. Images don’t appear to be far behind the text momentum.
Text-to-video and audio are bound to repsent the next big adoption waves in synthetic media after imges. Today, these are largely used to experiment. However, the day-to-day use cases are already emerging.
Skilled Creative CEO Brandon Kaplan said at the Synthedia conference that his team had already used the text-to-image solutions in creating mood boards and getting ideas for new visual elements for projects. Dabble Labs has been using a text-to-code generator for more than a year and there are many startups offering practical software features for text genertion.