Stability AI announced that Stable Diffusion 2.0 is available on GitHub. It will also be available through Stability AI “in the next few days.”
Stable Diffusion is the most widely adopted open source model for AI text-to-image generation. Stability AI, a company that actively supports Stable Diffusion and was co-founded by one of the model’s creators, revealed during a recent fund raising effort that over 10 million users had already used the service as of two months ago.
The company also points to a recent analysis by venture capital firm Andreessen Horowitz that Stable Diffusion was maybe the fastest to 10k GitHub stars on record. This reflects a lot of tinkering but also many dozens of deployments by both companies and hobbyists. The model is optimized to run on a single GPU making it accessible to just about anyone that would like to run and modify it.
New Stable Diffusion 2.0 Features
Stability AI’s GitHub post lists the following updates for Stable Diffusion 2.0.
New stable diffusion model (Stable Diffusion 2.0-v) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses OpenCLIP-ViT/H as the text encoder and is trained from scratch. SD 2.0-v is a so-called v-prediction model.
The above model is finetuned from SD 2.0-base, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
New depth-guided stable diffusion model, finetuned from SD 2.0-base. The model is conditioned on monocular depth estimates inferred via MiDaS and can be used for structure-preserving img2img and shape-conditional synthesis.
Let’s break that down a little for practical interpretation. The 768x768 resolution in Stable Diffusion 2.0 is an upgrade from 512x512. The latter is still the standard output, but there is a second version that defaults to 768 squares. This feature is in addition to a new upscaler which can enhance images to 2048x2048. That can be done with generated images and uploaded images.
The depth-guided model is also intriguing. For uploaded images it assesses the subject’s position and depth and takes that into account more effectively when generating a new or alternative image. Stability AI says, “This new model can be used for structure-preserving image-to-image and shape-conditional image synthesis.”
Rapid Improvement Cycles
The 2.0 release from Stable Diffusion comes only three months after the general availability release in August. Midjourney v3 and v4 betas were also less than three months apart. OpenAI’s DALL-E 1 and 2 announcements were only about four months apart. Given this pace, we might even be a bit behind on seeing DALL-E 3, but version 2 was such an extraordinary step up that it has not been an issue so far. The pattern, however, is unmistakable. Improvement in these models is progressing quickly.
We will have more on Stable 2.0 in upcoming posts. Until then, let us know what you think about the udpates.