Stable Video Diffusion, SDXL Turbo, IP Adapters, Kandinsky 3.0

Stable Diffusion Video

Stable Video Diffusion is a powerful image-to-video generation model that can generate high resolution (576x1024) 2-4 seconds videos conditioned on the input image.

Image to Video Generation

There are two variants of SVD. SVD and SVD-XT. The SVD checkpoint is trained to generate 14 frames and the SVD-XT checkpoint is further finetuned to generate 25 frames.

You need to condition the generation on an initial image, as follows:

import torch

from diffusers import StableVideoDiffusionPipeline
from diffusers.utils import load_image, export_to_video

pipe = StableVideoDiffusionPipeline.from_pretrained(
    "stabilityai/stable-video-diffusion-img2vid-xt", torch_dtype=torch.float16, variant="fp16"
)
pipe.enable_model_cpu_offload()

# Load the conditioning image
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/svd/rocket.png?download=true")
image = image.resize((1024, 576))

generator = torch.manual_seed(42)
frames = pipe(image, decode_chunk_size=8, generator=generator).frames[0]

export_to_video(frames, "generated.mp4", fps=7)

Since generating videos is more memory intensive, we can use the decode_chunk_size argument to control how many frames are decoded at once. This will reduce the memory usage. It's recommended to tweak this value based on your GPU memory. Setting decode_chunk_size=1 will decode one frame at a time and will use the least amount of memory, but the video might have some flickering.

Additionally, we also use model cpu offloading to reduce the memory usage.

rocket_generated

SDXL Turbo

SDXL Turbo is an adversarial time-distilled Stable Diffusion XL (SDXL) model capable of running inference in as little as 1 step. Also, it does not use classifier-free guidance, further increasing its speed. On a good consumer GPU, you can now generate an image in just 100ms.

Text-to-Image

For text-to-image, pass a text prompt. By default, SDXL Turbo generates a 512x512 image, and that resolution gives the best results. You can try setting the height and width parameters to 768x768 or 1024x1024, but you should expect quality degradations when doing so.

Make sure to set guidance_scale to 0.0 to disable, as the model was trained without it. A single inference step is enough to generate high quality images. Increasing the number of steps to 2, 3 or 4 should improve image quality.

from diffusers import AutoPipelineForText2Image
import torch

pipeline_text2image = AutoPipelineForText2Image.from_pretrained("stabilityai/sdxl-turbo", torch_dtype=torch.float16, variant="fp16")
pipeline_text2image = pipeline_text2image.to("cuda")

prompt = "A cinematic shot of a baby racoon wearing an intricate italian priest robe."

image = pipeline_text2image(prompt=prompt, guidance_scale=0.0, num_inference_steps=1).images[0]
image

Image-to-image

For image-to-image generation, make sure that num_inference_steps * strength is larger or equal to 1. The image-to-image pipeline will run for int(num_inference_steps * strength) steps, e.g. 0.5 * 2.0 = 1 step in our example below.

from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image, make_image_grid

# use from_pipe to avoid consuming additional memory when loading a checkpoint
pipeline = AutoPipelineForImage2Image.from_pipe(pipeline_text2image).to("cuda")

init_image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png")
init_image = init_image.resize((512, 512))

prompt = "cat wizard, gandalf, lord of the rings, detailed, fantasy, cute, adorable, Pixar, Disney, 8k"

image = pipeline(prompt, image=init_image, strength=0.5, guidance_scale=0.0, num_inference_steps=2).images[0]
make_image_grid([init_image, image], rows=1, cols=2)

IP Adapters

IP Adapters have shown to be remarkably powerful at images conditioned on other images.

Thanks to @okotaku, we have added IP adapters to the most important pipelines allowing you to combine them for a variety of different workflows, e.g. they work with Img2Img2, ControlNet, and LCM-LoRA out of the box.

LCM-LoRA

from diffusers import DiffusionPipeline, LCMScheduler
import torch
from diffusers.utils import load_image

model_id =  "sd-dreambooth-library/herge-style"
lcm_lora_id = "latent-consistency/lcm-lora-sdv1-5"

pipe = DiffusionPipeline.from_pretrained(model_id, torch_dtype=torch.float16)

pipe.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")
pipe.load_lora_weights(lcm_lora_id)
pipe.scheduler = LCMScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()

prompt = "best quality, high quality"
image = load_image("https://user-images.githubusercontent.com/24734142/266492875-2d50d223-8475-44f0-a7c6-08b51cb53572.png")
images = pipe(
    prompt=prompt,
    ip_adapter_image=image,
    num_inference_steps=4,
    guidance_scale=1,
).images[0]

yiyi_test_2_out

ControlNet

from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
from diffusers.utils import load_image

controlnet_model_path = "lllyasviel/control_v11f1p_sd15_depth"
controlnet = ControlNetModel.from_pretrained(controlnet_model_path, torch_dtype=torch.float16)

pipeline = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16)
pipeline.to("cuda")

image = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/statue.png")
depth_map = load_image("https://huggingface.co/datasets/YiYiXu/testing-images/resolve/main/depth.png")

pipeline.load_ip_adapter("h94/IP-Adapter", subfolder="models", weight_name="ip-adapter_sd15.bin")

generator = torch.Generator(device="cpu").manual_seed(33)
images = pipeline(
    prompt='best quality, high quality', 
    image=depth_map,
    ip_adapter_image=image,
    negative_prompt="monochrome, lowres, bad anatomy, worst quality, low quality", 
    num_inference_steps=50,
    generator=generator,
).images
images[0].save("yiyi_test_2_out.png")

ip_image	condition	output

For more information:

:point_right: https://huggingface.co/docs/diffusers/main/en/using-diffusers/loading_adapters#ip-adapter

Kandinsky 3.0

Kandinsky has released the 3rd version, which has much improved text-to-image alignment thanks to using Flan-T5 as the text encoder.

Text-to-Image

from diffusers import AutoPipelineForText2Image
import torch

pipe = AutoPipelineForText2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
        
prompt = "A photograph of the inside of a subway train. There are raccoons sitting on the seats. One of them is reading a newspaper. The window shows the city in the background."

generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, num_inference_steps=25, generator=generator).images[0]

Image-to-Image

from diffusers import AutoPipelineForImage2Image
from diffusers.utils import load_image
import torch

pipe = AutoPipelineForImage2Image.from_pretrained("kandinsky-community/kandinsky-3", variant="fp16", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
        
prompt = "A painting of the inside of a subway train with tiny raccoons."
image = load_image("https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/kandinsky3/t2i.png")

generator = torch.Generator(device="cpu").manual_seed(0)
image = pipe(prompt, image=image, strength=0.75, num_inference_steps=25, generator=generator).images[0]

Check it out:

:point_right: https://huggingface.co/docs/diffusers/main/en/api/pipelines/kandinsky3#kandinsky-3

All commits

LCM-LoRA docs by @patil-suraj in #5782
[Docs] Update and make improvements by @standardAI in #5819
[docs] Fix title by @stevhliu in #5831
Improve setup.py and add dependency check by @patrickvonplaten in #5826
[Docs] add: japanese sdxl as a reference by @sayakpaul in #5844
Set usedforsecurity=False in hashlib methods (FIPS compliance) by @Wauplin in #5790
fix memory consistency decoder test by @williamberman in #5828
[PEFT] Unpin peft by @patrickvonplaten in #5850
Speed up the peft lora unload by @pacman100 in #5741
[Tests/LoRA/PEFT] Test also on PEFT / transformers / accelerate latest by @younesbelkada in #5820
UnboundLocalError in SDXLInpaint.prepare_latents() by @a-r-r-o-w in #5648
[ControlNet] fix import in single file loading by @sayakpaul in #5834
[Styling] stylify using ruff by @kashif in #5841
[Community] [WIP] LCM Interpolation Pipeline by @a-r-r-o-w in #5767
[JAX] Replace uses of jax.devices("cpu") with jax.local_devices(backend="cpu") by @hvaara in #5864
[test / peft] Fix silent behaviour on PR tests by @younesbelkada in #5852
fix an issue that ipex occupy too much memory, it will not impact per… by @linlifan in #5625
Update LCMScheduler Inference Timesteps to be More Evenly Spaced by @dg845 in #5836
Revert "[Docs] Update and make improvements" by @standardAI in #5858
[docs] Loader APIs by @stevhliu in #5813
Update README.md by @co63oc in #5855
Add tests fetcher by @DN6 in #5848
Addition of new callbacks to controlnets by @a-r-r-o-w in #5812
[docs] MusicLDM by @stevhliu in #5854
Add features to the Dreambooth LoRA SDXL training script by @linoytsaban in #5508
[feat] IP Adapters (author @okotaku ) by @yiyixuxu in #5713
[Lora] Seperate logic by @patrickvonplaten in #5809
ControlNet+Adapter pipeline, and ControlNet+Adapter+Inpaint pipeline by @affromero in #5869
Adds an advanced version of the SD-XL DreamBooth LoRA training script supporting pivotal tuning by @linoytsaban in #5883
[bug fix] fix small bug in readme template of sdxl lora training script by @linoytsaban in #5906
[bug fix] fix small bug in readme template of sdxl lora training script by @linoytsaban in #5914
[Docs] add: 8bit inference with pixart alpha by @sayakpaul in #5814
[@cene555][Kandinsky 3.0] Add Kandinsky 3.0 by @patrickvonplaten in #5913
[Examples] Allow downloading variant model files by @patrickvonplaten in #5531
[Fix: pixart-alpha] random 512px resolution bug by @lawrence-cj in #5842
[Core] add support for gradient checkpointing in transformer_2d by @sayakpaul in #5943
Deprecate KarrasVeScheduler and ScoreSdeVpScheduler by @a-r-r-o-w in #5269
Add Custom Timesteps Support to LCMScheduler and Supported Pipelines by @dg845 in #5874
set the model to train state before accelerator prepare by @sywangyi in #5099
Avoid computing min() that is expensive when do_normalize is False in the image processor by @ivanprado in #5896
Fix LCM Stable Diffusion distillation bug related to parsing unet_time_cond_proj_dim by @dg845 in #5893
add LoRA weights load and fuse support for IPEX pipeline by @linlifan in #5920
Replace multiple variables with one variable. by @hi-sushanta in #5715
fix: error on device for lpw_stable_diffusion_xl pipeline if pipe.enable_sequential_cpu_offload() enabled by @VicGrygorchyk in #5885
[Vae] Make sure all vae's work with latent diffusion models by @patrickvonplaten in #5880
[Tests] Make sure that we don't run tests multiple times by @patrickvonplaten in #5949
[Community Pipeline] Diffusion Posterior Sampling for General Noisy Inverse Problems by @tongdaxu in #5939
[From_pretrained] Fix warning by @patrickvonplaten in #5948
[load_textual_inversion]: allow multiple tokens by @yiyixuxu in #5837
[docs] Fix space by @stevhliu in #5898
fix: minor typo in docstring by @soumik12345 in #5961
[ldm3d] Ldm3d upscaler to community pipeline by @estelleafl in #5870
[docs] Update pipeline list by @stevhliu in #5952
[Tests] Refactor test_examples.py for better readability by @sayakpaul in #5946
added doc for Kandinsky3.0 by @charchit7 in #5937
[bug fix] Inpainting for MultiAdapter by @affromero in #5922
Rename output_dir argument by @linhqyy in #5916
[LoRA refactor] move several state dict conversion utils out of lora.py by @sayakpaul in #5955
Support of ip-adapter to the StableDiffusionControlNetInpaintPipeline by @juancopi81 in #5887
[docs] LCM training by @stevhliu in #5796
Controlnet ssd 1b support by @MarkoKostiv in #5779
[Pipeline] Add TextToVideoZeroSDXLPipeline by @vahramtadevosyan in #4695
[Wuerstchen] Adapt lora training example scripts to use PEFT by @kashif in #5959
Fixed custom module importing on Windows by @PENGUINLIONG in #5891
Add SVD by @patil-suraj in #5895
[SDXL Turbo] Add some docs by @patrickvonplaten in #5982
Fix SVD doc by @patil-suraj in #5983

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@a-r-r-o-w
- UnboundLocalError in SDXLInpaint.prepare_latents() (#5648)
- [Community] [WIP] LCM Interpolation Pipeline (#5767)
- Addition of new callbacks to controlnets (#5812)
- Deprecate KarrasVeScheduler and ScoreSdeVpScheduler (#5269)
@dg845
- Update LCMScheduler Inference Timesteps to be More Evenly Spaced (#5836)
- Add Custom Timesteps Support to LCMScheduler and Supported Pipelines (#5874)
- Fix LCM Stable Diffusion distillation bug related to parsing unet_time_cond_proj_dim (#5893)
@affromero
- ControlNet+Adapter pipeline, and ControlNet+Adapter+Inpaint pipeline (#5869)
- [bug fix] Inpainting for MultiAdapter (#5922)
@tongdaxu
- [Community Pipeline] Diffusion Posterior Sampling for General Noisy Inverse Problems (#5939)
@estelleafl
- [ldm3d] Ldm3d upscaler to community pipeline (#5870)
@vahramtadevosyan
- [Pipeline] Add TextToVideoZeroSDXLPipeline (#4695)

v0.24.0

Stable Video Diffusion, SDXL Turbo, IP Adapters, Kandinsky 3.0

Stable Diffusion Video

Image to Video Generation

SDXL Turbo

Text-to-Image

Image-to-image

IP Adapters

LCM-LoRA

ControlNet

Kandinsky 3.0

Text-to-Image

Image-to-Image

All commits

Significant community contributions