v0.9.0

:art: Stable Diffusion 2 is here!

Installation

pip install diffusers[torch]==0.9 transformers

Stable Diffusion 2.0 is available in several flavors:

Stable Diffusion 2.0-V at `768x768`

New stable diffusion model (Stable Diffusion 2.0-v) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses OpenCLIP-ViT/H as the text encoder and is trained from scratch. SD 2.0-v is a so-called v-prediction model.

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler

repo_id = "stabilityai/stable-diffusion-2"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0]
image.save("astronaut.png")

Stable Diffusion 2.0-base at `512x512`

The above model is finetuned from SD 2.0-base, which was trained as a standard noise-prediction model on 512x512 images and is also made available.

import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler

repo_id = "stabilityai/stable-diffusion-2-base"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("astronaut.png")

Stable Diffusion 2.0 for Inpanting

This model for text-guided inpanting is finetuned from SD 2.0-base. Follows the mask-generation strategy presented in LAMA which, in combination with the latent VAE representations of the masked image, are used as an additional conditioning.

import PIL
import requests
import torch
from io import BytesIO
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler

def download_image(url):
    response = requests.get(url)
    return PIL.Image.open(BytesIO(response.content)).convert("RGB")

img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))

repo_id = "stabilityai/stable-diffusion-2-inpainting"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")

prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0]
image.save("yellow_cat.png")

Stable Diffusion X4 Upscaler

The model was trained on crops of size 512x512 and is a text-guided latent upscaling diffusion model. In addition to the textual input, it receives a noise_level as an input parameter, which can be used to add noise to the low-resolution input according to a predefined diffusion schedule.

import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionUpscalePipeline
import torch

model_id = "stabilityai/stable-diffusion-x4-upscaler"
pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")

url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
response = requests.get(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
low_res_img = low_res_img.resize((128, 128))

prompt = "a white cat"
upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
upscaled_image.save("upsampled_cat.png")

Saving & Loading is fixed for Versatile Diffusion

Previously there was a :bug: when saving & loading versatile diffusion - this is fixed now so that memory efficient saving & loading works as expected.

[Versatile Diffusion] Fix remaining tests by @patrickvonplaten in #1418

:memo: Changelog

add v prediction by @patil-suraj in #1386
Adapt UNet2D for supre-resolution by @patil-suraj in #1385
Version 0.9.0.dev0 by @anton-l in #1394
Make height and width optional by @patrickvonplaten in #1401
[Config] Add optional arguments by @patrickvonplaten in #1395
Upscaling fixed by @patrickvonplaten in #1402
Add the new SD2 attention params to the VD text unet by @anton-l in #1400
Deprecate sample size by @patrickvonplaten in #1406
Support SD2 attention slicing by @anton-l in #1397
Add SD2 inpainting integration tests by @anton-l in #1412
Fix sample size conversion script by @patrickvonplaten in #1408
fix clip guided by @patrickvonplaten in #1414
Fix all stable diffusion by @patrickvonplaten in #1415
[MPS] call contiguous after permute by @kashif in #1411
Deprecate predict_epsilon by @pcuenca in #1393
Fix ONNX conversion and inference by @anton-l in #1416
Allow to set config params directly in init by @patrickvonplaten in #1419
Add tests for Stable Diffusion 2 V-prediction 768x768 by @anton-l in #1420
StableDiffusionUpscalePipeline by @patil-suraj in #1396
added initial v-pred support to DPM-solver by @kashif in #1421
SD2 docs by @patrickvonplaten in #1424