Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images, allowing cheaper and faster inference.
Here is how to use the Würstchen as a pipeline:
import torch
from diffusers import AutoPipelineForText2Image
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS
pipeline = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")
caption = "Anthropomorphic cat dressed as a firefighter"
images = pipeline(
caption,
height=1024,
width=1536,
prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
prior_guidance_scale=4.0,
num_images_per_prompt=4,
).images
To learn more about the pipeline, check out the official documentation.
This pipeline was contributed by one of the authors of Würstchen, @dome272, with help from @kashif and @patrickvonplaten.
👉 Try out the model here: https://huggingface.co/spaces/warp-ai/Wuerstchen
T2I-Adapter is an efficient plug-and-play model that provides extra guidance to pre-trained text-to-image models while freezing the original large text-to-image models.
In collaboration with the Tencent ARC researchers, we trained T2I Adapters on various conditions: sketch, canny, lineart, depth, and openpose.
Below is an how to use the StableDiffusionXLAdapterPipeline.
First ensure, the controlnet_aux is installed:
pip install -U controlnet_aux==0.0.7
Then we can initialize the pipeline:
import torch
from controlnet_aux.lineart import LineartDetector
from diffusers import (AutoencoderKL, EulerAncestralDiscreteScheduler,
StableDiffusionXLAdapterPipeline, T2IAdapter)
from diffusers.utils import load_image, make_image_grid
# load adapter
adapter = T2IAdapter.from_pretrained(
"TencentARC/t2i-adapter-lineart-sdxl-1.0", torch_dtype=torch.float16, varient="fp16"
).to("cuda")
# load pipeline
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
euler_a = EulerAncestralDiscreteScheduler.from_pretrained(
model_id, subfolder="scheduler"
)
vae = AutoencoderKL.from_pretrained(
"madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
)
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
model_id,
vae=vae,
adapter=adapter,
scheduler=euler_a,
torch_dtype=torch.float16,
variant="fp16",
).to("cuda")
# load lineart detector
line_detector = LineartDetector.from_pretrained("lllyasviel/Annotators").to("cuda")
We then load an image to compute the lineart conditionings:
url = "https://huggingface.co/Adapter/t2iadapter/resolve/main/figs_SDXLV1.0/org_lin.jpg"
image = load_image(url)
image = line_detector(image, detect_resolution=384, image_resolution=1024)
Then we generate:
prompt = "Ice dragon roar, 4k photo"
negative_prompt = "anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured"
gen_images = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
image=image,
num_inference_steps=30,
adapter_conditioning_scale=0.8,
guidance_scale=7.5,
).images[0]
Refer to the official documentation to learn more about StableDiffusionXLAdapterPipeline.
This blog post summarizes our experiences and provides all the resources (including the pre-trained T2I Adapter checkpoints) to get started using T2I Adapters for SDXL.
We’re also releasing a training script for training your custom T2I Adapters on SDXL. Check out the documentation to learn more.
Thanks to @MC-E (one of the authors of T2I Adapters) for contributing the StableDiffusionXLAdapterPipeline in #4696.
We introduced “lazy imports” (#4829) to significantly improve the time it takes to import our modules (such as pipelines, models, and so on). Below is a comparison of the timings with and without lazy imports on import diffusers.
With lazy imports:
real 0m0.417s
user 0m0.714s
sys 0m0.499s
Without lazy imports:
real 0m5.391s
user 0m5.299s
sys 0m1.273s
Previously, loading LoRA parameters using the load_lora_weights() used to be time-consuming as reported in #4975. To this end, we introduced a low_cpu_mem_usage argument to the load_lora_weights() method in #4994 which should speed up the loading time significantly. Just pass low_cpu_mem_usage=True to take the benefits.
LoRA weights can now be fused into the model weights, thus allowing models that have loaded LoRA weights to run as fast as models without. It also enables to fuse multiple LoRAs into the same model.
For more information, have a look at the documentation and the original PR: https://github.com/huggingface/diffusers/pull/4473.
Almost all LoRA formats out there for SDXL are now supported. For a more details, please check the documentation.
AutoencoderTiny by @Isotr0py in #4627DIFFUSERS_TEST_DEVICE backend list with trying device by @vvvm23 in #4673train_text_to_image_lora_sdxl.py by @sayakpaul in #4632from_pretrained when load optional components by @yiyixuxu in #4745isinstance() by @kashif in #4992The following contributors have made significant changes to the library over the last release:
Stable Diffusion XL's strength default was accidentally set to 1.0 when creating the pipeline. The default should be set to 0.9999 instead. This patch release fixes that.
https://github.com/huggingface/diffusers/commit/3eb498e7b4868bca7460d41cda52d33c3ede5502#r125606630 introduced a 🐛 that broke the torch.compile() support for ControlNets. This patch release fixes that.
The 🧨 diffusers team has trained two ControlNets on Stable Diffusion XL (SDXL):
You can find all the SDXL ControlNet checkpoints here, including some smaller ones (5 to 7x smaller).
To know more about how to use these ControlNets to perform inference, check out the respective model cards and the documentation. To train custom SDXL ControlNets, you can try out our training script.
This release also introduces support for combining multiple ControlNets trained on SDXL and performing inference with them. Refer to the documentation to learn more.
The GLIGEN model was developed by researchers and engineers from University of Wisconsin-Madison, Columbia University, and Microsoft. The StableDiffusionGLIGENPipeline can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes, if input images are given, this pipeline can insert objects described by text at the region defined by bounding boxes. Otherwise, it’ll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It’s trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.
(GIF from the official website)
Grounded inpainting
import torch
from diffusers import StableDiffusionGLIGENPipeline
from diffusers.utils import load_image
# Insert objects described by text at the region defined by bounding boxes
pipe = StableDiffusionGLIGENPipeline.from_pretrained(
"masterful/gligen-1-4-inpainting-text-box", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
input_image = load_image(
"https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/livingroom_modern.png"
)
prompt = "a birthday cake"
boxes = [[0.2676, 0.6088, 0.4773, 0.7183]]
phrases = ["a birthday cake"]
images = pipe(
prompt=prompt,
gligen_phrases=phrases,
gligen_inpaint_image=input_image,
gligen_boxes=boxes,
gligen_scheduled_sampling_beta=1,
output_type="pil",
num_inference_steps=50,
).images
images[0].save("./gligen-1-4-inpainting-text-box.jpg")
Grounded generation
import torch
from diffusers import StableDiffusionGLIGENPipeline
from diffusers.utils import load_image
# Generate an image described by the prompt and
# insert objects described by text at the region defined by bounding boxes
pipe = StableDiffusionGLIGENPipeline.from_pretrained(
"masterful/gligen-1-4-generation-text-box", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")
prompt = "a waterfall and a modern high speed train running through the tunnel in a beautiful forest with fall foliage"
boxes = [[0.1387, 0.2051, 0.4277, 0.7090], [0.4980, 0.4355, 0.8516, 0.7266]]
phrases = ["a waterfall", "a modern high speed train running through the tunnel"]
images = pipe(
prompt=prompt,
gligen_phrases=phrases,
gligen_boxes=boxes,
gligen_scheduled_sampling_beta=1,
output_type="pil",
num_inference_steps=50,
).images
images[0].save("./gligen-1-4-generation-text-box.jpg")
Refer to the documentation to learn more.
Thanks to @nikhil-masterful for contributing GLIGEN in #4441.
@madebyollin trained two Autoencoders (on Stable Diffusion and Stable Diffusion XL, respectively) to dramatically cut down the image decoding time. The effects are especially pronounced when working with larger-resolution images. You can use AutoencoderTiny to take advantage of it.
Here’s the example usage for Stable Diffusion:
import torch
from diffusers import DiffusionPipeline, AutoencoderTiny
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
)
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("cheesecake.png")
Refer to the documentation to learn more. Refer to this material to understand the implications of using this Autoencoder in terms of inference latency and memory footprint.
Stable Diffusion XL’s (SDXL) high memory requirements often seem restrictive when it comes to using it for downstream applications. Even if one uses parameter-efficient fine-tuning techniques like LoRA, fine-tuning just the UNet component of SDXL can be quite memory-intensive. So, running it on a free-tier Colab Notebook (that usually has a 16 GB T4 GPU attached) seems impossible.
Now, with better support for gradient checkpointing and other recipes like 8 Bit Adam (via bitsandbytes), it is possible to fine-tune the UNet of SDXL with DreamBooth and LoRA on a free-tier Colab Notebook.
Check out the Colab Notebook to learn more.
Thanks to @ethansmith2000 for improving the gradient checkpointing support in #4474.
push_to_hub for models, schedulers, and pipelinesOur models, schedulers, and pipelines now support an option of push_to_hub via the save_pretrained() and also come with a push_to_hub() method. Below are some examples of usage.
Models
from diffusers import ControlNetModel
controlnet = ControlNetModel(
block_out_channels=(32, 64),
layers_per_block=2,
in_channels=4,
down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
cross_attention_dim=32,
conditioning_embedding_out_channels=(16, 32),
)
controlnet.push_to_hub("my-controlnet-model")
# or controlnet.save_pretrained("my-controlnet-model", push_to_hub=True)
Schedulers
from diffusers import DDIMScheduler
scheduler = DDIMScheduler(
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear",
clip_sample=False,
set_alpha_to_one=False,
)
scheduler.push_to_hub("my-controlnet-scheduler")
Pipelines
from diffusers import (
UNet2DConditionModel,
AutoencoderKL,
DDIMScheduler,
StableDiffusionPipeline,
)
from transformers import CLIPTextModel, CLIPTextConfig, CLIPTokenizer
unet = UNet2DConditionModel(
block_out_channels=(32, 64),
layers_per_block=2,
sample_size=32,
in_channels=4,
out_channels=4,
down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
cross_attention_dim=32,
)
scheduler = DDIMScheduler(
beta_start=0.00085,
beta_end=0.012,
beta_schedule="scaled_linear",
clip_sample=False,
set_alpha_to_one=False,
)
vae = AutoencoderKL(
block_out_channels=[32, 64],
in_channels=3,
out_channels=3,
down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
latent_channels=4,
)
text_encoder_config = CLIPTextConfig(
bos_token_id=0,
eos_token_id=2,
hidden_size=32,
intermediate_size=37,
layer_norm_eps=1e-05,
num_attention_heads=4,
num_hidden_layers=5,
pad_token_id=1,
vocab_size=1000,
)
text_encoder = CLIPTextModel(text_encoder_config)
tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")
components = {
"unet": unet,
"scheduler": scheduler,
"vae": vae,
"text_encoder": text_encoder,
"tokenizer": tokenizer,
"safety_checker": None,
"feature_extractor": None,
}
pipeline = StableDiffusionPipeline(**components)
pipeline.push_to_hub("my-pipeline")
Refer to the documentation to know more.
Thanks to @Wauplin for his generous and constructive feedback (refer to this #4218) on this feature.
Providing seamless support for loading Kohya-trained LoRA checkpoints from diffusers is important for us. This is why we continue to improve our load_lora_weights() method. Check out the documentation to know more about what’s currently supported and the current limitations.
Thanks to @isidentical for extending their help in improving this support.
Prompt weighting provides a way to emphasize or de-emphasize certain parts of a prompt, allowing for more control over the generated image. compel provides an easy way to do prompt weighting compatible with diffusers. To this end, we have worked on an improved guide. Check it out here.
.safetensorsStarting with this release, we will default to using .safetensors as our preferred serialization method. This change is reflected in all the training examples that we officially support.
prompt is None by @yiyixuxu in #4278make test-examples work correctly by @statelesshz in #4329is_safetensors_available() by @chiral-carbon in #4521from_single_file by @DN6 in #4571UnboundLocalError during LoRA loading by @slessans in #4523The following contributors have made significant changes to the library over the last release:
0.19.3 is a patch release to make sure import diffusers works without transformers being installed.
It includes a fix of this issue.
[SDXL] Fix dummy imports incorrect naming by @patrickvonplaten in https://github.com/huggingface/diffusers/pull/4370
We still had some bugs 🐛 in 0.19.1 some bugs, notably:
The official SD-XL 1.0 LoRA (Kohya-styled) is now supported thanks to https://github.com/huggingface/diffusers/pull/4287. You can try it as follows:
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16)
pipe.load_lora_weights("stabilityai/stable-diffusion-xl-base-1.0", weight_name="sd_xl_offset_example-lora_1.0.safetensors")
pipe.to("cuda")
prompt = "beautiful scenery nature glass bottle landscape, purple galaxy bottle"
negative_prompt = "text, watermark"
image = pipe(prompt, negative_prompt=negative_prompt, num_inference_steps=25).images[0]
In addition, a couple more SDXL LoRAs are now supported:
(SDXL 0.9:)
To know more details and the known limitations, please check out the documentation.
Thanks to @isidentical for their sincere help in the PR.
@bghira found that for SDXL Img2Img batched inference led to weird artifacts. That is fixed in: https://github.com/huggingface/diffusers/pull/4327.
Under some circumstances SD-XL 1.0 can download ONNX weights which is corrected in https://github.com/huggingface/diffusers/pull/4338.
https://github.com/huggingface/diffusers/pull/4346 allows the user to disable the watermarker under certain circumstances to improve the usability of SDXL.
In 0.19.0 some bugs :bug: found their way into the release. We're very sorry about this :pray:
This patch releases fixes all of them.
prompt is None by @yiyixuxu in #4278Stable Diffusion XL (SDXL) 1.0 with permissive CreativeML Open RAIL++-M License was released today. We provide full compatibility with SDXL in diffusers.
from diffusers import DiffusionPipeline
import torch
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt=prompt).images[0]
image
Many additional cool features are released:
Refer to the documentation to know more.
When there’s a new pipeline, there ought to be new training scripts. We added support for the following training scripts that build on top of SDXL:
Shoutout to @harutatsuakiyama for contributing the training script for InstructPix2Pix in #4079.
The ControlNet and InstructPix2Pix training scripts also needed their respective pipelines. So, we added support for the following pipelines as well:
StableDiffusionXLControlNetPipelineStableDiffusionXLInstructPix2PixPipelineThe ControlNet and InstructPix2Pix pipelines don’t have interesting checkpoints yet. We hope that the community will be able to leverage the training scripts from this release to help produce some.
Shoutout to @harutatsuakiyama for contributing the StableDiffusionXLInstructPix2PixPipeline in #4079.
We now support Auto APIs for the following tasks: text-to-image, image-to-image, and inpainting:
Here is how to use one:
from diffusers import AutoPipelineForTextToImage
import torch
pipe_t2i = AutoPipelineForText2Image.from_pretrained(
"runwayml/stable-diffusion-v1-5", requires_safety_checker=False, torch_dtype=torch.float16
).to("cuda")
prompt = "photo a majestic sunrise in the mountains, best quality, 4k"
image = pipe_t2i(prompt).images[0]
image.save("image.png")
Without any extra memory, you can then switch to Image-to-Image
from diffusers import AutoPipelineForImageToImage
pipe_i2i = AutoPipelineForImageToImage.from_pipe(pipe_t2i)
image = pipe_t2i("sunrise in snowy mountains", image=image, strength=0.75).images[0]
image.save("image.png")
Supported Pipelines: SDv1, SDv2, SDXL, Kandinksy, ControlNet, IF ... with more to come.
Refer to the documentation to know more.
We introduced a new “combined pipeline” for the Kandinsky series to make it easier to use the Kandinsky prior and decoder together. This eliminates the need to initialize and use multiple pipelines for Kandinsky to generate images. Here is an example:
from diffusers import AutoPipelineForTextToImage
import torch
pipe = AutoPipelineForTextToImage.from_pretrained(
"kandinsky-community/kandinsky-2-2-decoder", torch_dtype=torch.float16
)
pipe.enable_model_cpu_offload()
prompt = "A lion in galaxies, spirals, nebulae, stars, smoke, iridescent, intricate detail, octane render, 8k"
image = pipe(prompt=prompt, num_inference_steps=25).images[0]
image.save("image.png")
The following pipelines, which can be accessed via the "Auto" pipelines were added:
To know more, check out the following pages:
NOW: mask_image repaints white pixels and preserves black pixels.
Kandinksy was using an incorrect mask format. Instead of using white pixels as a mask (like SD & IF do), Kandinsky models were using black pixels. This needs to be corrected and so that the diffusers API is aligned. We cannot have different mask formats for different pipelines.
Important => This means that everyone that already used Kandinsky Inpaint in production / pipeline now needs to change the mask to:
# For PIL input
import PIL.ImageOps
mask = PIL.ImageOps.invert(mask)
# For PyTorch and Numpy input
mask = 1 - mask
Designing a Better Asymmetric VQGAN for StableDiffusion introduced a VQGAN that is particularly well-suited for inpainting tasks. This release brings the support of this new VQGAN. Here is how it can be used:
from io import BytesIO
from PIL import Image
import requests
from diffusers import AsymmetricAutoencoderKL, StableDiffusionInpaintPipeline
def download_image(url: str) -> Image.Image:
response = requests.get(url)
return Image.open(BytesIO(response.content)).convert("RGB")
prompt = "a photo of a person"
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/celeba_hq_256.png"
mask_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/repaint/mask_256.png"
image = download_image(img_url).resize((256, 256))
mask_image = download_image(mask_url).resize((256, 256))
pipe = StableDiffusionInpaintPipeline.from_pretrained("runwayml/stable-diffusion-inpainting")
pipe.vae = AsymmetricAutoencoderKL.from_pretrained("cross-attention/asymmetric-autoencoder-kl-x-1-5")
pipe.to("cuda")
image = pipe(prompt=prompt, image=image, mask_image=mask_image).images[0]
image.save("image.jpeg")
Refer to the documentation to know more.
Thanks to @cross-attention for contributing this model in #3956.
We are committed to providing seamless interoperability support of Kohya-trained checkpoints from diffusers. To that end, we improved the existing support for loading Kohya-trained checkpoints in diffusers. Users can expect further improvements in the upcoming releases.
Thanks to @takuma104 and @isidentical for contributing the improvements in #4147.
pip install matplotlib
from PIL import Image
import torch
import numpy as np
import matplotlib
from diffusers import T2IAdapter, StableDiffusionAdapterPipeline
def colorize(value, vmin=None, vmax=None, cmap='gray_r', invalid_val=-99, invalid_mask=None, background_color=(128, 128, 128, 255), gamma_corrected=False, value_transform=None):
"""Converts a depth map to a color image.
Args:
value (torch.Tensor, numpy.ndarry): Input depth map. Shape: (H, W) or (1, H, W) or (1, 1, H, W). All singular dimensions are squeezed
vmin (float, optional): vmin-valued entries are mapped to start color of cmap. If None, value.min() is used. Defaults to None.
vmax (float, optional): vmax-valued entries are mapped to end color of cmap. If None, value.max() is used. Defaults to None.
cmap (str, optional): matplotlib colormap to use. Defaults to 'magma_r'.
invalid_val (int, optional): Specifies value of invalid pixels that should be colored as 'background_color'. Defaults to -99.
invalid_mask (numpy.ndarray, optional): Boolean mask for invalid regions. Defaults to None.
background_color (tuple[int], optional): 4-tuple RGB color to give to invalid pixels. Defaults to (128, 128, 128, 255).
gamma_corrected (bool, optional): Apply gamma correction to colored image. Defaults to False.
value_transform (Callable, optional): Apply transform function to valid pixels before coloring. Defaults to None.
Returns:
numpy.ndarray, dtype - uint8: Colored depth map. Shape: (H, W, 4)
"""
if isinstance(value, torch.Tensor):
value = value.detach().cpu().numpy()
value = value.squeeze()
if invalid_mask is None:
invalid_mask = value == invalid_val
mask = np.logical_not(invalid_mask)
# normalize
vmin = np.percentile(value[mask],2) if vmin is None else vmin
vmax = np.percentile(value[mask],85) if vmax is None else vmax
if vmin != vmax:
value = (value - vmin) / (vmax - vmin) # vmin..vmax
else:
# Avoid 0-division
value = value * 0.
# squeeze last dim if it exists
# grey out the invalid values
value[invalid_mask] = np.nan
cmapper = matplotlib.cm.get_cmap(cmap)
if value_transform:
value = value_transform(value)
# value = value / value.max()
value = cmapper(value, bytes=True) # (nxmx4)
img = value[...]
img[invalid_mask] = background_color
if gamma_corrected:
img = img / 255
img = np.power(img, 2.2)
img = img * 255
img = img.astype(np.uint8)
return img
model = torch.hub.load("isl-org/ZoeDepth", "ZoeD_N", pretrained=True)
img = Image.open('./images/zoedepth_in.png')
out = model.infer_pil(img)
zoedepth_image = Image.fromarray(colorize(out)).convert('RGB')
zoedepth_image.save('images/zoedepth.png')
adapter = T2IAdapter.from_pretrained("TencentARC/t2iadapter_zoedepth_sd15v1", torch_dtype=torch.float16)
pipe = StableDiffusionAdapterPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", adapter=adapter, safety_checker=None, torch_dtype=torch.float16, variant="fp16"
)
pipe.to('cuda')
zoedepth_image_out = pipe(prompt="motorcycle", image=zoedepth_image).images[0]
zoedepth_image_out.save('images/zoedepth_out.png')
num_processes. by @eliphatfs in #3983noise_sampler_seed to StableDiffusionKDiffusionPipeline.__call__ by @sunhs in #3911act_fn param to OutValueFunctionBlock by @SauravMaheshkar in #3994text_encoder on stable_diffusion_xl pipelines by @apolinario in #4156network_alpha when loading unet lora from old format by @Jackmin801 in #4221prompt embeds in sdxl by @xiaohu2015 in #4099The following contributors have made significant changes to the library over the last release:
Patch release to fix:
torch.compile for SD-XL for certain GPUsfrom_single_file for all SD modelsNote:
Loading any stable diffusion safetensors or ckpt with StableDiffusionPipeline.from_single_file or StableDiffusionmg2ImgIPipeline.from_single_file or StableDiffusionInpaintPipeline.from_single_file or StableDiffusionXLPipeline.from_single_file, ...
is now almost as fast as from_pretrained(...) and it's much more tested now.
All commits:
float16 when using PyTorch 2 or xFormers by @pcuenca in #4019not in upscale pipeline by @pcuenca in #4020force_download in download utility by @Wauplin in #4036Stable Diffusion XL 0.9 is now fully supported under the SDXL 0.9 Research License license here.
Having received access to stabilityai/stable-diffusion-xl-base-0.9, you can easily use it with diffusers:
from diffusers import StableDiffusionXLPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt=prompt).images[0]
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
)
refiner.to("cuda")
prompt = "Astronaut in a jungle, cold color palette, muted colors, detailed, 8k"
image = pipe(prompt=prompt, output_type="latent" if use_refiner else "pil").images[0]
image = refiner(prompt=prompt, image=image[None, :]).images[0]
from diffusers import StableDiffusionXLPipeline, StableDiffusionXLImg2ImgPipeline
import torch
pipe = StableDiffusionXLPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-0.9", torch_dtype=torch.float16, variant="fp16", use_safetensors=True
)
pipe.to("cuda")
refiner = StableDiffusionXLImg2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-refiner-0.9", torch_dtype=torch.float16, use_safetensors=True, variant="fp16"
)
refiner.to("cuda")
- pipe.to("cuda")
+ pipe.enable_model_cpu_offload()
and
- refiner.to("cuda")
+ refiner.enable_model_cpu_offload()
+ pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
+ refiner.unet = torch.compile(refiner.unet, mode="reduce-overhead", fullgraph=True)
Note: If you're running the model with < torch 2.0, please make sure to run:
+pipe.enable_xformers_memory_efficient_attention()
+refiner.enable_xformers_memory_efficient_attention()
For more details have a look at the official docs.
Dropout to Flax UNet by @SauravMaheshkar in #3894Shap-E is a 3D image generation model from OpenAI introduced in Shap-E: Generating Conditional 3D Implicit Functions.
We provide support for text-to-3d image generation and 2d-to-3d image generation from Diffusers.
import torch
from diffusers import ShapEPipeline
from diffusers.utils import export_to_gif
ckpt_id = "openai/shap-e"
pipe = ShapEPipeline.from_pretrained(ckpt_id).to("cuda")
guidance_scale = 15.0
prompt = "A birthday cupcake"
images = pipe(
prompt,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size=256,
).images
gif_path = export_to_gif(images[0], "cake_3d.gif")
import torch
from diffusers import ShapEImg2ImgPipeline
from diffusers.utils import export_to_gif, load_image
ckpt_id = "openai/shap-e-img2img"
pipe = ShapEImg2ImgPipeline.from_pretrained(ckpt_id).to("cuda")
img_url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/shap_e/burger_in.png"
image = load_image(img_url)
generator = torch.Generator(device="cuda").manual_seed(0)
batch_size = 4
guidance_scale = 3.0
images = pipe(
image,
num_images_per_prompt=batch_size,
generator=generator,
guidance_scale=guidance_scale,
num_inference_steps=64,
frame_size =256,
output_type="pil"
).images
gif_path = export_to_gif(images[0], "burger_sampled_3d.gif")
Original image
Generated
For more details, check out the official documentation.
The model was contributed by @yiyixuxu in https://github.com/huggingface/diffusers/pull/3742.
Consistency models are diffusion models supporting fast one or few-step image generation. It was proposed by OpenAI in Consistency Models.
import torch
from diffusers import ConsistencyModelPipeline
device = "cuda"
# Load the cd_imagenet64_l2 checkpoint.
model_id_or_path = "openai/diffusers-cd_imagenet64_l2"
pipe = ConsistencyModelPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to(device)
# Onestep Sampling
image = pipe(num_inference_steps=1).images[0]
image.save("consistency_model_onestep_sample.png")
# Onestep sampling, class-conditional image generation
# ImageNet-64 class label 145 corresponds to king penguins
image = pipe(num_inference_steps=1, class_labels=145).images[0]
image.save("consistency_model_onestep_sample_penguin.png")
# Multistep sampling, class-conditional image generation
# Timesteps can be explicitly specified; the particular timesteps below are from the original Github repo.
# https://github.com/openai/consistency_models/blob/main/scripts/launch.sh#L77
image = pipe(timesteps=[22, 0], class_labels=145).images[0]
image.save("consistency_model_multistep_sample_penguin.png")
For more details, see the official docs.
The model was contributed by our community members @dg845 and @ayushtues in https://github.com/huggingface/diffusers/pull/3492.
Previous video generation pipelines tended to produce watermarks because those watermarks were present in their pretraining dataset. With the latest additions of the following checkpoints, we can now generate watermark-free videos:
import torch
from diffusers import DiffusionPipeline
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained("cerspense/zeroscope_v2_576w", torch_dtype=torch.float16)
pipe.enable_model_cpu_offload()
# memory optimization
pipe.unet.enable_forward_chunking(chunk_size=1, dim=1)
pipe.enable_vae_slicing()
prompt = "Darth Vader surfing a wave"
video_frames = pipe(prompt, num_frames=24).frames
video_path = export_to_video(video_frames)
For more details, check out the official docs.
It was contributed by @patrickvonplaten in https://github.com/huggingface/diffusers/pull/3900.
StableDiffusionKDiffusionPipeline by @tripathiarpan20 in #3751train_text_to_image.py script by @sayakpaul in #3810resnet.py by @SauravMaheshkar in #3868timestep_spacing and steps_offset to schedulers by @pcuenca in #3947UNet2DConditionOutput pickle-able by @prathikr in #3857torch.compile() compatibility by @sayakpaul in #3949The following contributors have made significant changes to the library over the last release:
Patch release to fix timestep for inpainting
Kandinsky 2.1 inherits best practices from DALL-E 2 and Latent Diffusion while introducing some new ideas.
pip install diffusers transformers accelerate
from diffusers import DiffusionPipeline
import torch
pipe_prior = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1-prior", torch_dtype=torch.float16)
pipe_prior.to("cuda")
t2i_pipe = DiffusionPipeline.from_pretrained("kandinsky-community/kandinsky-2-1", torch_dtype=torch.float16)
t2i_pipe.to("cuda")
prompt = "A alien cheeseburger creature eating itself, claymation, cinematic, moody lighting"
negative_prompt = "low quality, bad quality"
generator = torch.Generator(device="cuda").manual_seed(12)
image_embeds, negative_image_embeds = pipe_prior(prompt, negative_prompt, guidance_scale=1.0, generator=generator).to_tuple()
image = t2i_pipe(prompt, negative_prompt=negative_prompt, image_embeds=image_embeds, negative_image_embeds=negative_image_embeds).images[0]
image.save("cheeseburger_monster.png")
To learn more about the Kandinsky pipelines, and more details about speed and memory optimizations, please have a look at the docs.
Thanks @ayushtues, for helping with the integration of Kandinsky 2.1!
UniDiffuser introduces a multimodal diffusion process that is capable of handling different generation tasks using a single unified approach:
Below is an example of how to use UniDiffuser for text-to-image generation:
import torch
from diffusers import UniDiffuserPipeline
model_id_or_path = "thu-ml/unidiffuser-v1"
pipe = UniDiffuserPipeline.from_pretrained(model_id_or_path, torch_dtype=torch.float16)
pipe.to("cuda")
# This mode can be inferred from the input provided to the `pipe`.
pipe.set_text_to_image_mode()
prompt = "an elephant under the sea"
sample = pipe(prompt=prompt, num_inference_steps=20, guidance_scale=8.0).images[0]
sample.save("elephant.png")
Check out the UniDiffuser docs to know more.
UniDiffuser was added by @dg845 in this PR.
We're happy to support the A1111 formatted CivitAI LoRA checkpoints in a limited capacity.
First, download a checkpoint. We’ll use this one for demonstration purposes.
wget https://civitai.com/api/download/models/15603 -O light_and_shadow.safetensors
Next, we initialize a DiffusionPipeline:
import torch
from diffusers import StableDiffusionPipeline, DPMSolverMultistepScheduler
pipeline = StableDiffusionPipeline.from_pretrained(
"gsdf/Counterfeit-V2.5", torch_dtype=torch.float16, safety_checker=None
).to("cuda")
pipeline.scheduler = DPMSolverMultistepScheduler.from_config(
pipeline.scheduler.config, use_karras_sigmas=True
)
We then load the checkpoint downloaded from CivitAI:
pipeline.load_lora_weights(".", weight_name="light_and_shadow.safetensors")
(If you’re loading a checkpoint in the safetensors format, please ensure you have safetensors installed.)
And then it’s time for running inference:
prompt = "masterpiece, best quality, 1girl, at dusk"
negative_prompt = ("(low quality, worst quality:1.4), (bad anatomy), (inaccurate limb:1.2), "
"bad composition, inaccurate eyes, extra digit, fewer digits, (extra arms:1.2), large breasts")
images = pipeline(prompt=prompt,
negative_prompt=negative_prompt,
width=512,
height=768,
num_inference_steps=15,
num_images_per_prompt=4,
generator=torch.manual_seed(0)
).images
Below is a comparison between the LoRA and the non-LoRA results:
Check out the docs to learn more.
Thanks to @takuma104 for contributing this feature via this PR.
We introduced Torch 2.0 support for computing attention efficiently in 0.13.0. Since then, we have made a number of improvements to ensure the number of "graph breaks" in our models is reduced so that the models can be compiled with torch.compile(). As a result, we are happy to report massive improvements in the inference speed of our most popular pipelines. Check out this doc to know more.
Thanks to @Chillee for helping us with this. Thanks to @patrickvonplaten for fixing the problems stemming from "graph breaks" in this PR.
We added a Vae Image processor class that provides a unified API for pipelines to prepare their image inputs, as well as post-processing their outputs. It supports resizing, normalization, and conversion between PIL Image, PyTorch, and Numpy arrays.
With that, all Stable diffusion pipelines now accept image inputs in the format of Pytorch Tensor and Numpy array, in addition to PIL Image, and can produce outputs in these 3 formats. It will also accept and return latents. This means you can now take generated latents from one pipeline and pass them to another as inputs, without leaving the latent space. If you work with multiple pipelines, you can pass Pytorch Tensor between them without converting to PIL Image.
To learn more about the API, check out our doc here
ControlNet is one of the most used diffusion models and upon strong demand from the community we added controlnet img2img and controlnet inpaint pipelines. This allows to use any controlnet checkpoint for both image-2-image setting as well as for inpaint.
:point_right: Inpaint: See controlnet inpaint model here :point_right: Image-to-Image: Any controlnet checkpoint can be used for image to image, e.g.:
from diffusers import StableDiffusionControlNetImg2ImgPipeline, ControlNetModel, UniPCMultistepScheduler
from diffusers.utils import load_image
import numpy as np
import torch
import cv2
from PIL import Image
# download an image
image = load_image(
"https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/input_image_vermeer.png"
)
np_image = np.array(image)
# get canny image
np_image = cv2.Canny(np_image, 100, 200)
np_image = np_image[:, :, None]
np_image = np.concatenate([np_image, np_image, np_image], axis=2)
canny_image = Image.fromarray(np_image)
# load control net and stable diffusion v1-5
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetImg2ImgPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)
# speed up diffusion process with faster scheduler and memory optimization
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
# generate image
generator = torch.manual_seed(0)
image = pipe(
"futuristic-looking woman",
num_inference_steps=20,
generator=generator,
image=image,
control_image=canny_image,
).images[0]
This pipeline (introduced in DiffEdit: Diffusion-based semantic image editing with mask guidance) allows for image editing with natural language. Below is an end-to-end example.
First, let’s load our pipeline:
import torch
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
sd_model_ckpt = "stabilityai/stable-diffusion-2-1"
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
sd_model_ckpt,
torch_dtype=torch.float16,
safety_checker=None,
)
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
pipeline.enable_model_cpu_offload()
pipeline.enable_vae_slicing()
generator = torch.manual_seed(0)
Then, we load an input image to edit using our method:
from diffusers.utils import load_image
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
raw_image = load_image(img_url).convert("RGB").resize((768, 768))
Then, we employ the source and target prompts to generate the editing mask:
source_prompt = "a bowl of fruits"
target_prompt = "a basket of fruits"
mask_image = pipeline.generate_mask(
image=raw_image,
source_prompt=source_prompt,
target_prompt=target_prompt,
generator=generator,
)
Then, we employ the caption and the input image to get the inverted latents:
inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image, generator=generator).latents
Now, generate the image with the inverted latents and semantically generated mask:
image = pipeline(
prompt=target_prompt,
mask_image=mask_image,
image_latents=inv_latents,
generator=generator,
negative_prompt=source_prompt,
).images[0]
image.save("edited_image.png")
Check out the docs to learn more about this pipeline.
Thanks to @clarencechen for contributing this pipeline in this PR.
Apart from these, we have made multiple improvements to the overall quality-of-life of our docs.
Thanks to @stevhliu for leading the charge here.
use_Karras_sigmas to LMSDiscreteScheduler by @Isotr0py in #3351sigmoid beta_scheduler to docstrings of relevant Schedulers by @Laurent2916 in #3399AttnProcessor2_0 by @sayakpaul in #3457use_Karras_sigmas to DPMSolverSinglestepScheduler by @Isotr0py in #3476torch.compile tests in separate subprocesses by @pcuenca in #3503encoder_hid_dim_type=="text_proj" and allow xformers by @patrickvonplaten in #3615The following contributors have made significant changes to the library over the last release:
IF is a pixel-based text-to-image generation model and was released in late April 2023 by DeepFloyd.
The model architecture is strongly inspired by Google's closed-sourced Imagen and a novel state-of-the-art open-source text-to-image model with a high degree of photorealism and language understanding:
pip install torch --upgrade # diffusers' IF is optimized for torch 2.0
pip install diffusers --upgrade
Before you can use IF, you need to accept its usage conditions. To do so:
from huggingface_hub import login
login()
and enter your Hugging Face Hub access token.
from diffusers import DiffusionPipeline
from diffusers.utils import pt_to_pil
import torch
# stage 1
stage_1 = DiffusionPipeline.from_pretrained("DeepFloyd/IF-I-XL-v1.0", variant="fp16", torch_dtype=torch.float16)
stage_1.enable_model_cpu_offload()
# stage 2
stage_2 = DiffusionPipeline.from_pretrained(
"DeepFloyd/IF-II-L-v1.0", text_encoder=None, variant="fp16", torch_dtype=torch.float16
)
stage_2.enable_model_cpu_offload()
# stage 3
safety_modules = {
"feature_extractor": stage_1.feature_extractor,
"safety_checker": stage_1.safety_checker,
"watermarker": stage_1.watermarker,
}
stage_3 = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-x4-upscaler", **safety_modules, torch_dtype=torch.float16
)
stage_3.enable_model_cpu_offload()
prompt = 'a photo of a kangaroo wearing an orange hoodie and blue sunglasses standing in front of the eiffel tower holding a sign that says "very deep learning"'
generator = torch.manual_seed(1)
# text embeds
prompt_embeds, negative_embeds = stage_1.encode_prompt(prompt)
# stage 1
image = stage_1(
prompt_embeds=prompt_embeds, negative_prompt_embeds=negative_embeds, generator=generator, output_type="pt"
).images
pt_to_pil(image)[0].save("./if_stage_I.png")
# stage 2
image = stage_2(
image=image,
prompt_embeds=prompt_embeds,
negative_prompt_embeds=negative_embeds,
generator=generator,
output_type="pt",
).images
pt_to_pil(image)[0].save("./if_stage_II.png")
# stage 3
image = stage_3(prompt=prompt, image=image, noise_level=100, generator=generator).images
image[0].save("./if_stage_III.png")
For more details about speed and memory optimizations, please have a look at the blog or docs below.
:point_right: The official codebase :point_right: Blog post :point_right: Space Demo :point_right: In-detail docs
Lvmin Zhang has released improved ControlNet checkpoints as well as a couple of new ones.
You can find all :firecracker: Diffusers checkpoints here Please have a look directly at the model cards on how to use the checkpoins:
| Model Name | Control Image Overview | Control Image Example | Generated Image Example |
|---|---|---|---|
| lllyasviel/control_v11p_sd15_canny<br/> Trained with canny edge detection | A monochrome image with white edges on a black background. | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_canny/resolve/main/images/image_out.png"/></a> |
| lllyasviel/control_v11p_sd15_mlsd<br/> Trained with multi-level line segment detection | An image with annotated line segments. | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_mlsd/resolve/main/images/image_out.png"/></a> |
| lllyasviel/control_v11f1p_sd15_depth<br/> Trained with depth estimation | An image with depth information, usually represented as a grayscale image. | <a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11f1p_sd15_depth/resolve/main/images/image_out.png"/></a> |
| lllyasviel/control_v11p_sd15_normalbae<br/> Trained with surface normal estimation | An image with surface normal information, usually represented as a color-coded image. | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_normalbae/resolve/main/images/image_out.png"/></a> |
| lllyasviel/control_v11p_sd15_seg<br/> Trained with image segmentation | An image with segmented regions, usually represented as a color-coded image. | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_seg/resolve/main/images/image_out.png"/></a> |
| lllyasviel/control_v11p_sd15_lineart<br/> Trained with line art generation | An image with line art, usually black lines on a white background. | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_lineart/resolve/main/images/image_out.png"/></a> |
| lllyasviel/control_v11p_sd15_openpose<br/> Trained with human pose estimation | An image with human poses, usually represented as a set of keypoints or skeletons. | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_openpose/resolve/main/images/image_out.png"/></a> |
| lllyasviel/control_v11p_sd15_scribble<br/> Trained with scribble-based image generation | An image with scribbles, usually random or user-drawn strokes. | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_scribble/resolve/main/images/image_out.png"/></a> |
| lllyasviel/control_v11p_sd15_softedge<br/> Trained with soft edge image generation | An image with soft edges, usually to create a more painterly or artistic effect. | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_softedge/resolve/main/images/image_out.png"/></a> |
| Model Name | Control Image Overview | Control Image Example | Generated Image Example |
|---|---|---|---|
| lllyasviel/control_v11e_sd15_ip2p<br/> Trained with pixel to pixel instruction | No condition . | <a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_ip2p/resolve/main/images/image_out.png"/></a> |
| lllyasviel/control_v11p_sd15_inpaint<br/> Trained with image inpainting | No condition. | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15_inpaint/resolve/main/images/output.png"/></a> |
| lllyasviel/control_v11e_sd15_shuffle<br/> Trained with image shuffling | An image with shuffled patches or regions. | <a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11e_sd15_shuffle/resolve/main/images/image_out.png"/></a> |
| lllyasviel/control_v11p_sd15s2_lineart_anime<br/> Trained with anime line art generation | An image with anime-style line art. | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"><img width="64" style="margin:0;padding:0;" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/control.png"/></a> | <a href="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"><img width="64" src="https://huggingface.co/lllyasviel/control_v11p_sd15s2_lineart_anime/resolve/main/images/image_out.png"/></a> |
pipeline_stable_diffusion_controlnet.py by @remorses in #3118Transformer2DModel.forward docstring by @off99555 in #3074from_flax work for controlnet by @yiyixuxu in #3161Karras sigmas to HeunDiscreteScheduler by @youssefadr in #3160The following contributors have made significant changes to the library over the last release:
Fixes bugs related to missing global pooling in controlnet, img2img processor issue with safety checker, uneven timesteps and better config deprecation
We are very excited about this release! It brings new pipelines for video and audio to diffusers, showing that diffusion is a great choice for all sorts of generative tasks. The modular, pluggable approach of diffusers was crucial to integrate the new models intuitively and cohesively with the rest of the library. We hope you appreciate the consistency of the APIs and implementations, as our ultimate goal is to provide the best toolbox to help you solve the tasks you're interested in. Don't hesitate to get in touch if you use diffusers for other projects!
In addition to that, diffusers 0.15 includes a lot of new features and improvements. From performance and deployment improvements (faster pipeline loading) to increased flexibility for creative tasks (Karras sigmas, weight prompting, support for Automatic1111 textual inversion embeddings) to additional customization options (Multi-ControlNet) to training utilities (ControlNet, Min-SNR weighting). Read on for the details!
Text-guided video generation is not a fantasy anymore - it's as simple as spinning up a colab and running any of the two powerful open-sourced video generation models.
Alibaba's DAMO Vision Intelligence Lab has open-sourced a first research-only video generation model that can generatae some powerful video clips of up to a minute. To see Darth Vader riding a wave simply copy-paste the following lines into your favorite Python interpreter:
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
from diffusers.utils import export_to_video
pipe = DiffusionPipeline.from_pretrained("damo-vilab/text-to-video-ms-1.7b", torch_dtype=torch.float16, variant="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe.enable_model_cpu_offload()
prompt = "Spiderman is surfing"
video_frames = pipe(prompt, num_inference_steps=25).frames
video_path = export_to_video(video_frames)
For more information you can have a look at "damo-vilab/text-to-video-ms-1.7b"
Text2Video-Zero is a zero-shot text-to-video synthesis diffusion model that enables low cost yet consistent video generation with only pre-trained text-to-image diffusion models using simple pre-trained stable diffusion models, such as Stable Diffusion v1-5. Text2Video-Zero also naturally supports cool extension works of pre-trained text-to-image models such as Instruct Pix2Pix, ControlNet and DreamBooth, and based on which we present Video Instruct Pix2Pix, Pose Conditional, Edge Conditional and, Edge Conditional and DreamBooth Specialized applications.
For more information please have a look at PAIR/Text2Video-Zero
Text-guided audio generation has made great progress over the last months with many advances being based on diffusion models. The 0.15.0 release includes two powerful audio diffusion models.
Inspired by Stable Diffusion, AudioLDM is a text-to-audio latent diffusion model (LDM) that learns continuous audio representations from CLAP latents. AudioLDM takes a text prompt as input and predicts the corresponding audio. It can generate text-conditional sound effects, human speech and music.
from diffusers import AudioLDMPipeline
import torch
repo_id = "cvssp/audioldm"
pipe = AudioLDMPipeline.from_pretrained(repo_id, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "Techno music with a strong, upbeat tempo and high melodic riffs"
audio = pipe(prompt, num_inference_steps=10, audio_length_in_s=5.0).audios[0]
The resulting audio output can be saved as a .wav file:
import scipy
scipy.io.wavfile.write("techno.wav", rate=16000, data=audio)
For more information see cvssp/audioldm
This model from the Magenta team is a MIDI to audio generator. The pipeline takes a MIDI file as input and autoregressively generates 5-sec spectrograms which are concated together in the end and decoded to audio via a Spectrogram decoder.
from diffusers import SpectrogramDiffusionPipeline, MidiProcessor
pipe = SpectrogramDiffusionPipeline.from_pretrained("google/music-spectrogram-diffusion")
pipe = pipe.to("cuda")
processor = MidiProcessor()
# Download MIDI from: wget http://www.piano-midi.de/midis/beethoven/beethoven_hammerklavier_2.mid
output = pipe(processor("beethoven_hammerklavier_2.mid"))
audio = output.audios[0]
Documentation is crucially important for diffusers, as it's one of the first resources where people try to understand how everything works and fix any issues they are observing. We have spent a lot of time in this release reviewing all documents, adding new ones, reorganizing sections and bringing code examples up to date with the latest APIs. This effort has been led by @stevhliu (thanks a lot! 🙌) and @yiyixuxu, but many others have chimed in and contributed.
Check it out: https://huggingface.co/docs/diffusers/index
Don't hesitate to open PRs for fixes to the documentation, they are greatly appreciated as discussed in our (revised, of course) contribution guide.
Stable UnCLIP is the best open-sourced image variation model out there. Pass an initial image and optionally a prompt to generate variations of the image:
from diffusers import DiffusionPipeline
from diffusers.utils import load_image
import torch
pipe = DiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-unclip-small", torch_dtype=torch.float16)
pipe.to("cuda")
# get image
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/stable_unclip/tarsila_do_amaral.png"
image = load_image(url)
# run image variation
image = pipe(image).images[0]
For more information you can have a look at "stabilityai/stable-diffusion-2-1-unclip"
ControlNet was released in diffusers in version 0.14.0, but we have some exciting developments: Multi-ControlNet, a training script, and upcoming event and a community image-to-image pipeline contributed by @mikegarts!
Thanks to community member @takuma104, it's now possible to use several ControlNet conditioning models at once! It works with the same API as before, only supplying a list of ControlNets instead of just once:
import torch
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
controlnet_canny = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny",
torch_dtype=torch.float16).to("cuda")
controlnet_pose = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-openpose",
torch_dtype=torch.float16).to("cuda")
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"example/a-sd15-variant-model", torch_dtype=torch.float16,
controlnet=[controlnet_pose, controlnet_canny]
).to("cuda")
pose_image = ...
canny_image = ...
prompt = ...
image = pipe(prompt=prompt, image=[pose_image, canny_image]).images[0]
And this is an example of how this affects generation:
| Control Image1 | Control Image2 | Generated |
|---|---|---|
| <img width="200" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/multi_controlnet/pac_pose_512x512.png"> | <img width="200" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/multi_controlnet/pac_canny_512x512.png"> | <img width="200" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/multi_controlnet/mc_pose_and_canny_result_19.png"> |
| <img width="200" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/multi_controlnet/pac_pose_512x512.png"> | (none) | <img width="200" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/multi_controlnet/mc_pose_only_result_19.png"> |
| <img width="200" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/multi_controlnet/pac_canny_512x512.png"> | (none) | <img width="200" src="https://huggingface.co/takuma104/controlnet_dev/resolve/main/multi_controlnet/mc_canny_only_result_19.png"> |
We have created a training script for ControlNet, and can't wait to see what new ideas the community may come up with! In fact, we are so pumped about it that we are organizing a JAX Diffusers sprint with a special focus on ControlNet, where participant teams will be assigned TPUs v4-8 to work on their projects :exploding_head:. Those are some mean machines, so make sure you join our discord to follow the event: https://discord.com/channels/879548962464493619/897387888663232554/1092751149217615902.
Several great contributors have been working on textual inversion to get the most of it. @isamu-isozaki made it possible to perform multitoken training, and @piEsposito & @GuiyeC created an easy way to load textual inversion embeddings. These contributors are always a pleasure to work with 🙌, we feel honored and proud of this community 🙏
Loading textual inversion embeddings is compatible with the Automatic1111 format, so you can download embeddings from other services (such as civitai), and easily apply them in diffusers. Please check the updated documentation for details.
We conducted a thorough investigation of the pipeline loading process to make it as fast as possible. This is the before and after:
Previous: 2.27 sec
Now: 1.1 sec
Instead of performing 3 HTTP operations, we now get all we need with just one. That single call is necessary to check whether any of the components in the pipeline were updated – if that's the case, then we need to download the new files. This improvement also applies when you load individual models instead of pre-trained pipelines.
This may not sound as much, but many people use diffusers for user-facing services where models and pipelines have to be reused on demand. By minimizing latency, they can provide a better service to their users and minimize operating costs.
This can be further reduced by forcing diffusers to just use the items on disk and never check for updates. This is not recommended for most users, but can be interesting in production environments.
compelWeight prompting is a popular method to increase the importance of some of the elements that appear in a text prompt, as a way to force image generation to obey to those concepts. Because diffusers is used in multitude of services and projects, we wanted to provide a very flexible way to adopt prompt weighting, so users can ultimately build the system they prefer. Our apprach was to:
compel, by @damian0815, as a higher-level library to create the weighted embeddings.You don't have to use compel to create the embeddings, but if you do, this is an example of how it looks in practice:
from diffusers import StableDiffusionPipeline, UniPCMultistepScheduler
from compel import Compel
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
pipe.scheduler = UniPCMultistepScheduler.from_config(pipe.scheduler.config)
compel_proc = Compel(tokenizer=pipe.tokenizer, text_encoder=pipe.text_encoder)
prompt = "a red cat playing with a ball++"
prompt_embeds = compel_proc(prompt)
image = pipe(prompt_embeds=prompt_embeds, num_inference_steps=20).images[0]
As you can see, we assign more weight to the ball word using a compel-specific syntax (ball++). You can use other libraries (or your own) to create appropriate embeddings to pass to the pipeline.
You can read more details in the documentation.
Some diffusers schedulers now support Karras sigmas! Thanks @nipunjindal !
See Add Karras pattern to discrete euler in #2956 for more information.
safetensors and LoRa. by @Narsil in #2448xformers support to train_unconditional.py by @vvvm23 in #2520transformers is not released yet by @patrickvonplaten in #2623EMAModel by @sayakpaul in #2530use_safetensors argument to give more control to users by @Narsil in #2123optimum by @sayakpaul in #2702mps: remove warmup passes by @pcuenca in #2771mps in text-to-video tests by @pcuenca in #2792examples README.md to include the latest examples by @sayakpaul in #2839last_epoch argument to optimization.get_scheduler by @felixblanke in #2850image_embeds None case is handled properly in StableUnCLIPImg2ImgPipeline by @sayakpaul in #2861StableUnCLIPPipeline in the pipeline docs by @sayakpaul in #2897Karras sigmas for StableDiffusionKDiffusionPipeline by @takuma104 in #2874upload_folder in training scripts by @Wauplin in #2934AttentionProcessor.group_norm num_channels should be query_dim by @williamberman in #3046The following contributors have made significant changes to the library over the last release: