Thanks to an amazing collaboration with community member @takuma104 🙌, diffusers fully supports ControlNet! All 8 control models from the paper are available for you to use: depth, scribbles, edges, and more. Best of all is that you can take advantage of all the other goodies and optimizations that Diffusers provides out of the box, making this an ultra fast implementation of ControlNet. Take it for a spin to see for yourself.
ControlNet works by training a copy of some of the layers of the original Stable Diffusion model on additional signals, such as depth maps or scribbles. After training, you can provide a depth map as a strong hint of the composition you want to achieve, and have Stable Diffusion fill in the details for you. For example:
<table> <tr style="text-align: center;"> <th>Before</th> <th>After</th> </tr> <tr> <td><img class="mx-auto" src="https://huggingface.co/datasets/YiYiXu/controlnet-testing/resolve/main/house_depth.png" width=300/></td> <td><img class="mx-auto" src="https://huggingface.co/datasets/YiYiXu/controlnet-testing/resolve/main/house_after.jpeg" width=300/></td> </tr> </table>Currently, there are 8 published control models, all of which were trained on runwayml/stable-diffusion-v1-5 (i.e., Stable Diffusion version 1.5). This is an example that uses the scribble controlnet model:
Or you can turn a cartoon into a realistic photo with incredible coherence:
<img src="https://huggingface.co/datasets/YiYiXu/controlnet-testing/resolve/main/lofi.jpg" height="400" alt="ControlNet showing a photo generated from a cartoon frame">How do you use ControlNet in diffusers? Just like this (example for the canny edges control model):
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
controlnet = ControlNetModel.from_pretrained("lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16
)
As usual, you can use all the features in the diffusers toolbox: super-fast schedulers, memory-efficient attention, model offloading, etc. We think 🧨 Diffusers is the best way to iterate on your ControlNet experiments!
Please, refer to our blog post and documentation for details.
(And, coming soon, ControlNet training – stay tuned!)
Another community member, @kig, conceived, proposed and fully implemented an amazing PR that allows generation of ultra-high resolution images without memory blowing up 🤯. They follow a tiling approach during the image decoding phase of the process, generating a piece of the image at a time and then stitching them all together. Tiles are blended carefully to avoid visible seems between them, and the final result is amazing. This is the additional code you need to use to enjoy high-resolution generations:
pipe.vae.enable_tiling()
That's it!
For a complete example, refer to the PR or the code snippet we reproduce here for your convenience:
import torch
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", revision="fp16", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
pipe.enable_xformers_memory_efficient_attention()
pipe.vae.enable_tiling()
prompt = "a beautiful landscape photo"
image = pipe(prompt, width=4096, height=2048, num_inference_steps=10).images[0]
image.save("4k_landscape.jpg")
mps test fixes by @pcuenca in #2470train_unconditional by @pcuenca in #2481The following contributors have made significant changes to the library over the last release:
Full Changelog: https://github.com/huggingface/diffusers/compare/v0.13.0...v0.14.0
There has been much recent work on fine-grained control of diffusion networks!
Diffusers now supports:
See our doc on controlling image generation and the individual pipeline docs for more details on the individual methods.
Latent Upscaler is a diffusion model that is designed explicitly for Stable Diffusion. You can take the generated latent from Stable Diffusion and pass it into the upscaler before decoding with your standard VAE. Or you can take any image, encode it into the latent space, use the upscaler, and decode it. It is incredibly flexible and can work with any SD checkpoints.
| Original output image | 2x upscaled output image |
|---|---|
The model was developed by Katherine Crowson in collaboration with Stability AI
from diffusers import StableDiffusionLatentUpscalePipeline, StableDiffusionPipeline
import torch
pipeline = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
pipeline.to("cuda")
upscaler = StableDiffusionLatentUpscalePipeline.from_pretrained("stabilityai/sd-x2-latent-upscaler", torch_dtype=torch.float16)
upscaler.to("cuda")
prompt = "a photo of an astronaut high resolution, unreal engine, ultra realistic"
generator = torch.manual_seed(33)
# we stay in latent space! Let's make sure that Stable Diffusion returns the image
# in latent space
low_res_latents = pipeline(prompt, generator=generator, output_type="latent").images
upscaled_image = upscaler(
prompt=prompt,
image=low_res_latents,
num_inference_steps=20,
guidance_scale=0,
generator=generator,
).images[0]
# Let's save the upscaled image under "upscaled_astronaut.png"
upscaled_image.save("astronaut_1024.png")
# as a comparison: Let's also save the low-res image
with torch.no_grad():
image = pipeline.decode_latents(low_res_latents)
image = pipeline.numpy_to_pil(image)[0]
image.save("astronaut_512.png")
In addition to new features and an increasing number of pipelines, diffusers cares a lot about performance. This release brings a number of optimizations that you can turn on easily.
Memory efficient attention, as implemented by xFormers, has been available in diffusers for some time. The problem was that installing xFormers could be complicated because there were no official pip wheels (or they were outdated), and you had to resort to installing from source.
From xFormers 0.0.16, official pip wheels are now published with every release, so installing and using xFormers is now as simple as these two steps:
pip install xformers in your terminal.pipe.enable_xformers_memory_efficient_attention() in your code to opt-in in your pipelines.These actions will unlock dramatic memory savings, and usually faster inference too!
See more details in the documentation.
Speaking of memory-efficient attention, Accelerated PyTorch 2.0 Transformers now comes with built-in native support for it! When PyTorch 2.0 is released you'll no longer have to install xFormers or any third-party package to take advantage of it. In diffusers we are already preparing for that, and it works out of the box. So, if you happen to be using the latest "nightlies" of PyTorch 2.0 beta, then you're all set – diffusers will use Accelerated PyTorch 2.0 Transformers by default.
In our tests, the built-in PyTorch 2.0 implementation is usually as fast as xFormers', and sometimes even faster. Performance depends on the card you are using and whether you run your code in float16 or float32, so check our documentation for details.
Community member @keturn, with whom we have enjoyed thoughtful software design conversations, called our attention to the fact that enabling sequential cpu offloading via enable_sequential_cpu_offload worked great to save a lot of memory, but made inference much slower.
This is because enable_sequential_cpu_offload() is optimized for memory, and it recursively works across all the submodules contained in a model, moving them to GPU when they are needed and back to CPU when another submodule needs to run. These cpu-to-gpu-to-cpu transfers happen hundreds of times during the stable diffusion denoising loops, because the UNet runs multiple times and it consists of several PyTorch modules.
This release of diffusers introduces a coarser enable_model_cpu_offload() pipeline API, which copies whole models (not modules) to GPU and makes sure they stay there until another model needs to run. The consequences are:
enable_sequential_cpu_offload, but:<a name="pix2pix-zero"></a>
Remember the CycleGAN days where one would turn a horse into a zebra in an image while keeping the rest of the content almost untouched? Well, that day has arrived but in the context of Diffusion. Pix2Pix Zero allows users to edit a particular image (be it real or generated), targeting a source concept (horse, for example) and replacing it with a target concept (zebra, for example).
| Input image | Edited image |
|---|---|
Pix2Pix Zero was proposed in Zero-shot Image-to-Image Translation. The StableDiffusionPix2PixZeroPipeline allows you to
For the latter, it uses the newly introduced DDIMInverseScheduler to first obtain the inverted noise from the input image and use that in the subsequent generation process.
Both of the use cases leverage the idea of "edit directions", used for steering the generation toward the target concept gradually from the source concept. To know more, we recommend checking out the official documentation.
<a name="attend-excite"></a>
Attend-and-Excite: Attention-Based Semantic Guidance for Text-to-Image Diffusion Models. Attend-and-Excite, guides the generative model to modify the cross-attention values during the image synthesis process to generate images that more faithfully depict the input text prompt. It allows creating images that are more semantically faithful with respect to the input text prompts. Thanks to community contributor @evinpinar for leading the charge to add this pipeline!
<a name="semantic-guidance"></a>
Semantic Guidance for Diffusion Models was proposed in SEGA: Instructing Diffusion using Semantic Dimensions and provides strong semantic control over image generation. Small changes to the text prompt usually result in entirely different output images. However, with SEGA, a variety of changes to the image are enabled that can be controlled easily and intuitively and stay true to the original image composition. Thanks to the lead author of SEFA, Manuel (@manuelbrack), who added the pipeline in #2223.
Here is a simple demo:
import torch
from diffusers import SemanticStableDiffusionPipeline
pipe = SemanticStableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
out = pipe(
prompt="a photo of the face of a woman",
num_images_per_prompt=1,
guidance_scale=7,
editing_prompt=[
"smiling, smile", # Concepts to apply
"glasses, wearing glasses",
"curls, wavy hair, curly hair",
"beard, full beard, mustache",
],
reverse_editing_direction=[False, False, False, False], # Direction of guidance i.e. increase all concepts
edit_warmup_steps=[10, 10, 10, 10], # Warmup period for each concept
edit_guidance_scale=[4, 5, 5, 5.4], # Guidance scale for each concept
edit_threshold=[
0.99,
0.975,
0.925,
0.96,
], # Threshold for each concept. Threshold equals the percentile of the latent space that will be discarded. I.e. threshold=0.99 uses 1% of the latent dimensions
edit_momentum_scale=0.3, # Momentum scale that will be added to the latent guidance
edit_mom_beta=0.6, # Momentum beta
edit_weights=[1, 1, 1, 1, 1], # Weights of the individual concepts against each other
)
<a name="self-attention-guidance"></a>
SAG was proposed in Improving Sample Quality of Diffusion Models Using Self-Attention Guidance. SAG works by extracting the intermediate attention map from a diffusion model at every iteration and selects tokens above a certain attention score for masking and blurring to obtain a partially blurred input. Then, the dissimilarity is measured between the predicted noise outputs obtained from feeding the blurred and original input to the diffusion model and this is further leveraged as guidance. With this guidance, the authors observe apparent improvements in a wide range of diffusion models.
import torch
from diffusers import StableDiffusionSAGPipeline
from accelerate.utils import set_seed
pipe = StableDiffusionSAGPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
seed = 8978
prompt = "."
guidance_scale = 7.5
num_images_per_prompt = 1
sag_scale = 1.0
set_seed(seed)
images = pipe(
prompt, num_images_per_prompt=num_images_per_prompt, guidance_scale=guidance_scale, sag_scale=sag_scale
).images
images[0].save("example.png")
SAG was contributed by @SusungHong (lead author of SAG) in https://github.com/huggingface/diffusers/pull/2193.
<a name="panorama"></a>
Proposed in MultiDiffusion: Fusing Diffusion Paths for Controlled Image Generation, it presents a new generation process, "MultiDiffusion", based on an optimization task that binds together multiple diffusion generation processes with a shared set of parameters or constraints.
import torch
from diffusers import StableDiffusionPanoramaPipeline, DDIMScheduler
model_ckpt = "stabilityai/stable-diffusion-2-base"
scheduler = DDIMScheduler.from_pretrained(model_ckpt, subfolder="scheduler")
pipe = StableDiffusionPanoramaPipeline.from_pretrained(model_ckpt, scheduler=scheduler, torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "a photo of the dolomites"
image = pipe(prompt).images[0]
image.save("dolomites.png")
The pipeline was contributed by @omerbt (lead author of MultiDiffusion Panorama) and @sayakpaul in #2393.
Diffusers is no stranger to the different opinions and perspectives about the challenges that generative technologies bring. Thanks to @giadilli, we have drafted our first Diffusers' Ethical Guidelines with which we hope to initiate a fruitful conversation with the community.
Many practitioners find it easy to fine-tune the Stable Diffusion models shipped by KerasCV. At the same time, diffusers provides a lot of options for inference, deployment and optimization. We have made it possible to easily import and use KerasCV Stable Diffusion checkpoints in diffusers, read more about the process in our new guide.
UniPC is a new fast scheduler in diffusion town! UniPC is a training-free framework designed for the fast sampling of diffusion models, which consists of a corrector (UniC) and a predictor (UniP) that share a unified analytical form and support arbitrary orders. The orginal codebase can be found here. Thanks to @wl-zhao for the great work and integrating UniPC into the diffusers!
As part of 0.13.0 we improved the support for EMA in training. We added a common EMAModel in diffusers.training_utils which can be used by all scripts. The EMAModel is improved to support distributed training,
new methods to easily evaluate the EMA model during training and a consistent way to save and load the EMA model similar to other models in diffusers.
We have replaced flake8 with ruff (much faster), and updated our version of black. These tools are now in sync with the ones used in transformers, so the contributing experience is now more consistent for people using both codebases :)
predict_epsilon by @patrickvonplaten in #2109UNet2DModel to use arbitrary class embeddings by @pcuenca in #2080torwards -> towards by @RahulBhalley in #2134local_files_only is specifiied by @patrickvonplaten in #2119safetensors docs. by @Narsil in #2122[diffusers-cli] Fix typo in accelerate and transformers versions by @pcuenca in #2154requests instead of wget in convert_from_ckpt.py by @Abhishek-Varma in #2168~s in favor of full-fledged links. by @sayakpaul in #2229key optional so default pipelines don't fail by @pcuenca in #2176from_flax by @pcuenca in #2187accelerate save & loading hooks to have better checkpoint structure by @patrickvonplaten in #2048state_dict() and load_state_dict() & add cur_decay_value by @chenguolin in #2146store() and restore() methods to EMAModel by @sayakpaul in #2302enable_model_cpu_offload by @pcuenca in #2285The following contributors have made significant changes to the library over the last release:
Make sure cached models can be loaded in offline mode.
local_files_only is specifiied by @patrickvonplaten in #2119Instruct-Pix2Pix is a Stable Diffusion model fine-tuned for editing images from human instructions. Given an input image and a written instruction that tells the model what to do, the model follows these instructions to edit the image.
The model was released with the paper InstructPix2Pix: Learning to Follow Image Editing Instructions. More information about the model can be found in the paper.
pip install diffusers transformers safetensors accelerate
import PIL
import requests
import torch
from diffusers import StableDiffusionInstructPix2PixPipeline
model_id = "timbrooks/instruct-pix2pix"
pipe = StableDiffusionInstructPix2PixPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
url = "https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/mountain.png"
def download_image(url):
image = PIL.Image.open(requests.get(url, stream=True).raw)
image = PIL.ImageOps.exif_transpose(image)
image = image.convert("RGB")
return image
image = download_image(url)
prompt = "make the mountains snowy"
edit = pipe(prompt, image=image, num_inference_steps=20, image_guidance_scale=1.5, guidance_scale=7).images[0]
images[0].save("snowy_mountains.png")
Diffusion Transformers (DiTs) is a class conditional latent diffusion model which replaces the commonly used U-Net backbone with a transformer operating on latent patches. The pretrained model is trained on the ImageNet-1K dataset and is able to generate class conditional images of 256x256 or 512x512 pixels.
The model was released with the paper Scalable Diffusion Models with Transformers.
import torch
from diffusers import DiTPipeline
model_id = "facebook/DiT-XL-2-256"
pipe = DiTPipeline.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
# pick words that exist in ImageNet
words = ["white shark", "umbrella"]
class_ids = pipe.get_label_ids(words)
output = pipe(class_labels=class_ids)
image = output.images[0] # label 'white shark'
LoRA is a technique for performing parameter-efficient fine-tuning for large models. LoRA works by adding so-called "update matrices" to specific blocks of a pre-trained model. During fine-tuning, only these update matrices are updated while the pre-trained model parameters are kept frozen. This allows us to achieve greater memory efficiency as well as easier portability during fine-tuning.
LoRA was proposed in LoRA: Low-Rank Adaptation of Large Language Models. In the original paper, the authors investigated LoRA for fine-tuning large language models like GPT-3. cloneofsimo was the first to try out LoRA training for Stable Diffusion in the popular lora GitHub repository.
Diffusers now supports LoRA! This means you can now fine-tune a model like Stable Diffusion using consumer GPUs like Tesla T4 or RTX 2080 Ti. LoRA support was added to UNet2DConditionModel and DreamBooth training script by @patrickvonplaten in #1884.
By using LoRA, the fine-tuned checkpoints will be just 3 MBs in size. After fine-tuning, you can use the LoRA checkpoints like so:
from diffusers import StableDiffusionPipeline
import torch
model_path = "sayakpaul/sd-model-finetuned-lora-t4"
pipe = StableDiffusionPipeline.from_pretrained("CompVis/stable-diffusion-v1-4", torch_dtype=torch.float16)
pipe.unet.load_attn_procs(model_path)
pipe.to("cuda")
prompt = "A pokemon with blue eyes."
image = pipe(prompt, num_inference_steps=30, guidance_scale=7.5).images[0]
image.save("pokemon.png")
You can follow these resources to know more about how to use LoRA in diffusers:
LoRA leverages a new method to customize the cross attention layers deep in the UNet. This can be useful for other creative approaches such as Prompt-to-Prompt, and it makes it easier to apply optimizers like xFormers. This new "attention processor" abstraction was created by @patrickvonplaten in #1639 after discussing the design with the community, and we have used it to rewrite our xFormers and attention slicing implementations!
A long requested feature, prolific community member @camenduru took up the gauntlet in #1900 and created a way to convert Flax model weights for PyTorch. This means that you can train or fine-tune models super fast using Google TPUs, and then convert the weights to PyTorch for everybody to use. Thanks @camenduru!
Another community member, @dhruvrnaik, ported the image-to-image pipeline to Flax in #1355! Using a TPU v2-8 (available in Colab's free tier), you can generate 8 images at once in a few seconds!
DEIS (Diffusion Exponential Integrator Sampler) is a new fast mult step scheduler that can generate high-quality samples in fewer steps. The scheduler was introduced in the paper Fast Sampling of Diffusion Models with Exponential Integrator. More information about the scheduler can be found in the paper.
from diffusers import StableDiffusionPipeline, DEISMultistepScheduler
import torch
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16)
pipe.scheduler = DEISMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
generator = torch.Generator(device="cuda").manual_seed(0)
image = pipe(prompt, generator=generator, num_inference_steps=25).images[0
One can now pass CPU generators to all pipelines even if the pipeline is on GPU. This ensures much better reproducibility across GPU hardware:
import torch
from diffusers import DDIMPipeline
import numpy as np
model_id = "google/ddpm-cifar10-32"
# load model and scheduler
ddim = DDIMPipeline.from_pretrained(model_id)
ddim.to("cuda")
# create a generator for reproducibility
generator = torch.manual_seed(0)
# run pipeline for just two steps and return numpy tensor
image = ddim(num_inference_steps=2, output_type="np", generator=generator).images
print(np.abs(image).sum())
See: #1902 and https://huggingface.co/docs/diffusers/using-diffusers/reproducibility
save_pretrained(...) doesn't accidentally delete files: #2038report_to in training scripts by @pcuenca in #1888diffusers-cli env by @anton-l in #1898lora tag to the model tags by @apolinario in #2103The following contributors have made significant changes to the library over the last release:
This patch release fixes a bug with num_images_per_prompt in the UnCLIPPipeline
Karlo is a text-conditional image generation model based on OpenAI's unCLIP architecture with the improvement over the standard super-resolution model from 64px to 256px, recovering high-frequency details in a small number of denoising steps.
This alpha version of Karlo is trained on 115M image-text pairs, including COYO-100M high-quality subset, CC3M, and CC12M. For more information about the architecture, see the Karlo repository: https://github.com/kakaobrain/karlo
pip install diffusers transformers safetensors accelerate
import torch
from diffusers import UnCLIPPipeline
pipe = UnCLIPPipeline.from_pretrained("kakaobrain/karlo-v1-alpha", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
prompt = "a high-resolution photograph of a big red frog on a green leaf."
image = pipe(prompt).images[0]
The community pipelines hosted in diffusers/examples/community will now follow the installed version of the library.
E.g. if you have diffusers==0.9.0 installed, the pipelines from the v0.9.0 branch will be used: https://github.com/huggingface/diffusers/tree/v0.9.0/examples/community
If you've installed diffusers from source, e.g. with pip install git+https://github.com/huggingface/diffusers then the latest versions of the pipelines will be fetched from the main branch.
To change the custom pipeline version, set the custom_revision variable like so:
pipeline = DiffusionPipeline.from_pretrained(
"google/ddpm-cifar10-32", custom_pipeline="one_step_unet", custom_revision="0.10.2"
)
Many of the most important checkpoints now have https://github.com/huggingface/safetensors available. Upon installing safetensors with:
pip install safetensors
You will see a nice speed-up when loading your model :rocket:
Some of the most improtant checkpoints have safetensor weights added now:
We fixed a lot of bugs for batched generation. All pipelines should now correctly process batches of prompts and images :hugs: Also we made it much easier to tweak images with reproducible seeds: https://huggingface.co/docs/diffusers/using-diffusers/reusing_seeds
convert_diffusers_to_original_stable_diffusion.py by @apolinario in #1681This patch removes the hard requirement for transformers>=4.25.1 in case external libraries were downgrading the library upon startup in a non-controllable way.
🚨🚨🚨 Note that xformers in not automatically enabled anymore 🚨🚨🚨
The reasons for this are given here: https://github.com/huggingface/diffusers/pull/1640#discussion_r1044651551:
We should not automatically enable xformers for three reasons:
It's not PyTorch-like API. PyTorch doesn't by default enable all the fastest options available We allocate GPU memory before the user even does .to("cuda") This behavior is not consistent with cases where xformers is not installed
=> This means: If you were used to have xformers automatically enabled, please make sure to add the following now:
from diffusers.utils.import_utils import is_xformers_available
unet = ... # load unet
if is_xformers_available():
try:
unet.enable_xformers_memory_efficient_attention(True)
except Exception as e:
logger.warning(
"Could not enable memory efficient attention. Make sure xformers is installed"
f" correctly and a GPU is available: {e}"
)
for the UNet (e.g. in dreambooth) or for the pipeline:
from diffusers.utils.import_utils import is_xformers_available
pipe = ... # load pipeline
if is_xformers_available():
try:
pipe.enable_xformers_memory_efficient_attention(True)
except Exception as e:
logger.warning(
"Could not enable memory efficient attention. Make sure xformers is installed"
f" correctly and a GPU is available: {e}"
)
This patch returns enable_xformers_memory_efficient_attention() to UNet2DCondition to restore backward compatibility.
The new depth-guided stable diffusion model is fully supported in this release. The model is conditioned on monocular depth estimates inferred via MiDaS and can be used for structure-preserving img2img and shape-conditional synthesis.
Installing the transformers library from source is required for the MiDaS model:
pip install --upgrade git+https://github.com/huggingface/transformers/
import torch
import requests
from PIL import Image
from diffusers import StableDiffusionDepth2ImgPipeline
pipe = StableDiffusionDepth2ImgPipeline.from_pretrained(
"stabilityai/stable-diffusion-2-depth",
torch_dtype=torch.float16,
).to("cuda")
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
init_image = Image.open(requests.get(url, stream=True).raw)
prompt = "two tigers"
n_propmt = "bad, deformed, ugly, bad anotomy"
image = pipe(prompt=prompt, image=init_image, negative_prompt=n_propmt, strength=0.7).images[0]
The updated Stable Diffusion 2.1 checkpoints are also released and fully supported:
We now support SafeTensors: a new simple format for storing tensors safely (as opposed to pickle) that is still fast (zero-copy).
| Format | Safe | Zero-copy | Lazy loading | No file size limit | Layout control | Flexibility | Bfloat16 |
|---|---|---|---|---|---|---|---|
| pickle (PyTorch) | ✗ | ✗ | ✗ | ✓ | ✗ | ✓ | ✓ |
| H5 (Tensorflow) | ✓ | ✗ | ✓ | ✓ | ~ | ~ | ✗ |
| SavedModel (Tensorflow) | ✓ | ✗ | ✗ | ✓ | ✓ | ✗ | ✓ |
| MsgPack (flax) | ✓ | ✓ | ✗ | ✓ | ✗ | ✗ | ✓ |
| SafeTensors | ✓ | ✓ | ✓ | ✓ | ✓ | ✗ | ✓ |
**More details about the comparison here: https://github.com/huggingface/safetensors#yet-another-format-
pip install safetensors
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
pipe.save_pretrained("./safe-stable-diffusion-2-1", safe_serialization=True)
# you can also push this checkpoint to the HF Hub and load from there
safe_pipe = StableDiffusionPipeline.from_pretrained("./safe-stable-diffusion-2-1")
An implementation of Paint by Example: Exemplar-based Image Editing with Diffusion Models by Binxin Yang, Shuyang Gu, Bo Zhang, Ting Zhang, Xuejin Chen, Xiaoyan Sun, Dong Chen, Fang Wen
import PIL
import requests
import torch
from io import BytesIO
from diffusers import DiffusionPipeline
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/image/example_1.png"
mask_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/mask/example_1.png"
example_url = "https://raw.githubusercontent.com/Fantasy-Studio/Paint-by-Example/main/examples/reference/example_1.jpg"
init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))
example_image = download_image(example_url).resize((512, 512))
pipe = DiffusionPipeline.from_pretrained("Fantasy-Studio/Paint-by-Example", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
image = pipe(image=init_image, mask_image=mask_image, example_image=example_image).images[0]
Audio Diffusion leverages the recent advances in image generation using diffusion models by converting audio samples to and from mel spectrogram images.
from IPython.display import Audio
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained("teticio/audio-diffusion-ddim-256").to("cuda")
output = pipe()
display(output.images[0])
display(Audio(output.audios[0], rate=pipe.mel.get_sample_rate()))
This pipeline is added to support the latest schedulers from @crowsonkb's k-diffusion The purpose of this pipeline is to compare scheduler implementations and updates, so new features from other pipelines are unlikely to be supported!
pip install k-diffusion
from diffusers import StableDiffusionKDiffusionPipeline
import torch
pipe = StableDiffusionKDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1-base")
pipe = pipe.to("cuda")
pipe.set_scheduler("sample_heun")
image = pipe("astronaut riding horse", num_inference_steps=25).images[0]
Algorithm 1 of Karras et. al. Scheduler ported from @crowsonkb’s k-diffusion
from diffusers import HeunDiscreteScheduler
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
pipe.scheduler = HeunDiscreteScheduler.from_config(pipe.scheduler.config)
Original paper can be found here and the improved version. The original implementation can be found here.
from diffusers import DPMSolverSinglestepScheduler
pipe = StableDiffusionPipeline.from_pretrained("stabilityai/stable-diffusion-2-1")
pipe.scheduler = DPMSolverSinglestepScheduler.from_config(pipe.scheduler.config)
from_pt by @pcuenca in #1436ort_nightly_directml to the onnxruntime candidates by @anton-l in #1458train_unconditional_ort by @anton-l in #1504image argument in all pipelines by @fboulnois in #1361--image_size to the conversion script by @anton-l in #1509pip install diffusers[torch]==0.9 transformers
Stable Diffusion 2.0 is available in several flavors:
768x768New stable diffusion model (Stable Diffusion 2.0-v) at 768x768 resolution. Same number of parameters in the U-Net as 1.5, but uses OpenCLIP-ViT/H as the text encoder and is trained from scratch. SD 2.0-v is a so-called v-prediction model.
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "stabilityai/stable-diffusion-2"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, guidance_scale=9, num_inference_steps=25).images[0]
image.save("astronaut.png")
512x512The above model is finetuned from SD 2.0-base, which was trained as a standard noise-prediction model on 512x512 images and is also made available.
import torch
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "stabilityai/stable-diffusion-2-base"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "High quality photo of an astronaut riding a horse in space"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("astronaut.png")
This model for text-guided inpanting is finetuned from SD 2.0-base. Follows the mask-generation strategy presented in LAMA which, in combination with the latent VAE representations of the masked image, are used as an additional conditioning.
import PIL
import requests
import torch
from io import BytesIO
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
init_image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))
repo_id = "stabilityai/stable-diffusion-2-inpainting"
pipe = DiffusionPipeline.from_pretrained(repo_id, torch_dtype=torch.float16, revision="fp16")
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config)
pipe = pipe.to("cuda")
prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
image = pipe(prompt=prompt, image=init_image, mask_image=mask_image, num_inference_steps=25).images[0]
image.save("yellow_cat.png")
The model was trained on crops of size 512x512 and is a text-guided latent upscaling diffusion model. In addition to the textual input, it receives a noise_level as an input parameter, which can be used to add noise to the low-resolution input according to a predefined diffusion schedule.
import requests
from PIL import Image
from io import BytesIO
from diffusers import StableDiffusionUpscalePipeline
import torch
model_id = "stabilityai/stable-diffusion-x4-upscaler"
pipeline = StableDiffusionUpscalePipeline.from_pretrained(model_id, revision="fp16", torch_dtype=torch.float16)
pipeline = pipeline.to("cuda")
url = "https://huggingface.co/datasets/hf-internal-testing/diffusers-images/resolve/main/sd2-upscale/low_res_cat.png"
response = requests.get(url)
low_res_img = Image.open(BytesIO(response.content)).convert("RGB")
low_res_img = low_res_img.resize((128, 128))
prompt = "a white cat"
upscaled_image = pipeline(prompt=prompt, image=low_res_img).images[0]
upscaled_image.save("upsampled_cat.png")
Previously there was a :bug: when saving & loading versatile diffusion - this is fixed now so that memory efficient saving & loading works as expected.
predict_epsilon by @pcuenca in #1393This patch release fixes an error with CLIPVisionModelWithProjection imports on a non-git transformers installation.
:warning: Please upgrade with pip install --upgrade diffusers or pip install diffusers==0.8.1
VersatileDiffusion, released by SHI-Labs, is a unified multi-flow multimodal diffusion model that is capable of doing multiple tasks such as text2image, image variations, dual-guided(text+image) image generation, image2text.
transformers from "main":pip install git+https://github.com/huggingface/transformers
Then you can run:
from diffusers import VersatileDiffusionPipeline
import torch
import requests
from io import BytesIO
from PIL import Image
pipe = VersatileDiffusionPipeline.from_pretrained("shi-labs/versatile-diffusion", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
# initial image
url = "https://huggingface.co/datasets/diffusers/images/resolve/main/benz.jpg"
response = requests.get(url)
image = Image.open(BytesIO(response.content)).convert("RGB")
# prompt
prompt = "a red car"
# text to image
image = pipe.text_to_image(prompt).images[0]
# image variation
image = pipe.image_variation(image).images[0]
# image variation
image = pipe.dual_guided(prompt, image).images[0]
More in-depth details can be found on:
AltDiffusion is a multilingual latent diffusion model that supports text-to-image generation for 9 different languages: English, Chinese, Spanish, French, Japanese, Korean, Arabic, Russian and Italian.
StableDiffusionImageVariationPipeline by @justinpinkney is a stable diffusion model that takes an image as an input and generates variations of that image. It is conditioned on CLIP image embeddings instead of text.
Safe Latent Diffusion (SLD), released by ml-research@TUDarmstadt group, is a new practical and sophisticated approach to prevent unsolicited content from being generated by diffusion models. One of the authors of the research contributed their implementation to diffusers.
vq diffusion classifier free sampling by @williamberman #1294
LDM super resolution is a latent 4x super-resolution diffusion model released by CompVis.
CycleDiffusion is a method that uses Text-to-Image Diffusion Models for Image-to-Image Editing. It is capable of
Uses CLIPSeg to automatically generate a mask using segmentation, and then applies Stable Diffusion in-painting.
K-Diffusion Pipeline is community pipeline that allows to use any sampler from K-diffusion with diffusers models.
DPMSolverMultistepScheduler is the 🧨 diffusers implementation of DPM-Solver++, a state-of-the-art scheduler that was contributed by one of the authors of the paper. This scheduler is able to achieve great quality in as few as 20 steps. It's a drop-in replacement for the default Stable Diffusion scheduler, so you can use it to essentially half generation times. It works so well that we adopted it for the Stable Diffusion demo Spaces: https://huggingface.co/spaces/stabilityai/stable-diffusion, https://huggingface.co/spaces/runwayml/stable-diffusion-v1-5.
You can use it like this:
from diffusers import DiffusionPipeline, DPMSolverMultistepScheduler
repo_id = "runwayml/stable-diffusion-v1-5"
scheduler = DPMSolverMultistepScheduler.from_pretrained(repo_id, subfolder="scheduler")
stable_diffusion = DiffusionPipeline.from_pretrained(repo_id, scheduler=scheduler)
The example above also demonstrates how to load schedulers using a new API that is coherent with model loading and therefore more natural and intuitive.
You can load a scheduler using from_pretrained, as demonstrated above, or you can instantiate one from an existing scheduler configuration. This is a way to replace the scheduler of a pipeline that was previously loaded:
from diffusers import DiffusionPipeline, EulerDiscreteScheduler
pipeline = DiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
Read more about these changes in the documentation. See also the community pipeline that allows using any of the K-diffusion samplers with diffusers, as mentioned above!
We work relentlessly to incorporate performance optimizations and memory reduction techniques to 🧨 diffusers. These are two of the most noteworthy incorporations in this release:
StableDiffusionOnnxPipeline by @pcuenca in #1191repeat_interleave fix for mps to stable diffusion image2image pipeline by @jncasey in #1135UNet2DModel and UNet2DConditionModel by @xenova in #1275pipeline_stable_diffusion_inpaint.py:prepare_mask_and_masked_image by @vict0rsch in #1003UNet2DModel by @patil-suraj in #1216This patch release fixes a bug that broken the Flax Stable Diffusion Inference. Thanks a mille for spotting it @camenduru in https://github.com/huggingface/diffusers/issues/1145 and thanks a lot to @pcuenca and @kashif for fixing it in https://github.com/huggingface/diffusers/pull/1149
This patch release makes accelerate a soft dependency to avoid an error when installing diffusers with pre-existing torch.
:warning: The PyTorch pipelines now require accelerate for improved model loading times!
Install Diffusers with pip install --upgrade diffusers[torch] to get everything in a single command.
PyTorch and Apple have been working on improving mps support in PyTorch 1.13, so Apple Silicon is now a first-class citizen in diffusers 0.7.0!
Memory management is crucial to achieve fast generation speed. We recommend to always use attention slicing on Apple Silicon, as it drastically reduces memory pressure and prevents paging or swapping. This is especially important for computers with less than 64 GB of Unified RAM, and may be the difference between generating an image in seconds rather than in minutes. Use it like this:
from diffusers import StableDiffusionPipeline
pipe = StableDiffusionPipeline.from_pretrained("runwayml/stable-diffusion-v1-5")
pipe = pipe.to("mps")
# Recommended if your computer has < 64 GB of RAM
pipe.enable_attention_slicing()
prompt = "a photo of an astronaut riding a horse on mars"
# First-time "warmup" pass
_ = pipe(prompt, num_inference_steps=1)
image = pipe(prompt).images[0]
image.save("astronaut.png")
Our automated tests now include a full battery of tests on the mps device. This will be helpful to identify issues early and ensure the quality on Apple Silicon going forward.
See more details in the documentation.
diffusers goes audio 🎵 Dance Diffusion by Harmonai is the first audio model in 🧨Diffusers!
Try it out to generate some random music:
from diffusers import DiffusionPipeline
import scipy
model_id = "harmonai/maestro-150k"
pipeline = DiffusionPipeline.from_pretrained(model_id)
pipeline = pipeline.to("cuda")
audio = pipeline(audio_length_in_s=4.0).audios[0]
# To save locally
scipy.io.wavfile.write("maestro_test.wav", pipe.unet.sample_rate, audio.transpose())
These are the Euler schedulers, from the paper Elucidating the Design Space of Diffusion-Based Generative Models by Karras et al. (2022). The diffusers implementation is based on the original k-diffusion implementation by Katherine Crowson. The Euler schedulers are fast, often times generating really good outputs with 20-30 steps.
from diffusers import StableDiffusionPipeline, EulerDiscreteScheduler
euler_scheduler = EulerDiscreteScheduler.from_config("runwayml/stable-diffusion-v1-5", subfolder="scheduler")
pipeline = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", scheduler=euler_scheduler, revision="fp16", torch_dtype=torch.float16
)
pipeline.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipeline(prompt, num_inference_steps=25).images[0]
from diffusers import StableDiffusionPipeline, EulerAncestralDiscreteScheduler
euler_ancestral_scheduler = EulerAncestralDiscreteScheduler.from_config("runwayml/stable-diffusion-v1-5", subfolder="scheduler")
pipeline = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5", scheduler=euler_scheduler, revision="fp16", torch_dtype=torch.float16
)
pipeline.to("cuda")
prompt = "a photo of an astronaut riding a horse on mars"
image = pipeline(prompt, num_inference_steps=25).images[0]
memory_efficient_attentionEven faster and memory efficient stable diffusion using the efficient flash attention implementation from xformers
To leverage it just make sure you have:
from diffusers import StableDiffusionPipeline
import torch
pipe = StableDiffusionPipeline.from_pretrained(
"runwayml/stable-diffusion-v1-5",
revision="fp16",
torch_dtype=torch.float16,
).to("cuda")
pipe.enable_xformers_memory_efficient_attention()
with torch.inference_mode():
sample = pipe("a small cat")
# optional: You can disable it via
# pipe.disable_xformers_memory_efficient_attention()
Thanks to accelerate, pipeline loading is much, much faster. There are two parts to it:
low_cpu_mem_usage (enabled by default), no initialization will be performed.device_map="auto" to automatically select the best device(s) where the pre-trained weights will be initially sent to.In our tests, loading time was more than halved on CUDA devices, and went down from 12s to 4s on an Apple M1 computer.
As a side effect, CPU usage will be greatly reduced during loading, because no temporary copies of the weights are necessary.
This feature requires PyTorch 1.9 or better and accelerate 0.8.0 or higher.
RePaint allows to reuse any pretrained DDPM model for free-form inpainting by adding restarts to the denoising schedule. Based on the paper RePaint: Inpainting using Denoising Diffusion Probabilistic Models by Andreas Lugmayr et al.
from diffusers import RePaintPipeline, RePaintScheduler
# Load the RePaint scheduler and pipeline based on a pretrained DDPM model
scheduler = RePaintScheduler.from_config("google/ddpm-ema-celebahq-256")
pipe = RePaintPipeline.from_pretrained("google/ddpm-ema-celebahq-256", scheduler=scheduler)
pipe = pipe.to("cuda")
generator = torch.Generator(device="cuda").manual_seed(0)
output = pipe(
original_image=original_image,
mask_image=mask_image,
num_inference_steps=250,
eta=0.0,
jump_length=10,
jump_n_sample=10,
generator=generator,
)
inpainted_image = output.images[0]
The Pipeline lets you input prompt without 77 token length limit. And you can increase words weighting by using "()" or decrease words weighting by using "[]". The Pipeline also lets you use the main use cases of the stable diffusion pipeline in a single class. For a code example, see Long Prompt Weighting Stable Diffusion
Generate an image from an audio sample using pre-trained OpenAI whisper-small and Stable Diffusion. For a code example, see Speech to Image
A minimal implementation that allows for users to add "wildcards", denoted by __wildcard__ to prompts that are used as placeholders for randomly sampled values given by either a dictionary or a .txt file.
For a code example, see Wildcard Stable Diffusion
Use logic operators to do compositional generation. For a code example, see Composable Stable Diffusion
Image editing with Stable Diffusion. For a code example, see Imagic Stable Diffusion
Allows to generate a larger image while keeping the content of the original image. For a code example, see Seed Resizing
torch_type -> torch_dtype by @pcuenca in #972mps reproducibility issue when running with pytest-xdist by @anton-l in #976init_git_repo, refactor train_unconditional.py by @anton-l in #1022--dataloader_num_workers to the DDPM training example by @anton-l in #1027numpy_to_pil by @anton-l in #1025F.interpolate() for large batch sizes by @NouamaneTazi in #1006mps by @pcuenca in #961safety_checker to be None when using CPU offload by @pcuenca in #1078None pipeline components by @anton-l in #1118The first official stable diffusion checkpoint fine-tuned on inpainting has been released.
You can try it out in the official demo here
or code it up yourself :computer: :
from io import BytesIO
import torch
import PIL
import requests
from diffusers import StableDiffusionInpaintPipeline
def download_image(url):
response = requests.get(url)
return PIL.Image.open(BytesIO(response.content)).convert("RGB")
img_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png"
mask_url = "https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png"
image = download_image(img_url).resize((512, 512))
mask_image = download_image(mask_url).resize((512, 512))
pipe = StableDiffusionInpaintPipeline.from_pretrained(
"runwayml/stable-diffusion-inpainting",
revision="fp16",
torch_dtype=torch.float16,
)
pipe.to("cuda")
prompt = "Face of a yellow cat, high resolution, sitting on a park bench"
output = pipe(prompt=prompt, image=image, mask_image=mask_image)
image = output.images[0]
gives:
image | mask_image | prompt | Output | |
|---|---|---|---|---|
| <img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo.png" alt="drawing" width="200"/> | <img src="https://raw.githubusercontent.com/CompVis/latent-diffusion/main/data/inpainting_examples/overture-creations-5sI6fQgYIuo_mask.png" alt="drawing" width="200"/> | Face of a yellow cat, high resolution, sitting on a park bench | => | <img src="https://huggingface.co/datasets/patrickvonplaten/images/resolve/main/test.png" alt="drawing" width="200"/> |
:warning: This release deprecates the unsupervised noising-based inpainting pipeline into StableDiffusionInpaintPipelineLegacy.
The new StableDiffusionInpaintPipeline is based on a Stable Diffusion model finetuned for the inpainting task: https://huggingface.co/runwayml/stable-diffusion-inpainting
Note When loading
StableDiffusionInpaintPipelinewith a non-finetuned model (i.e. the one saved withdiffusers<=0.5.1), the pipeline will default toStableDiffusionInpaintPipelineLegacy, to maintain backward compatibility :sparkles:
from diffusers import StableDiffusionInpaintPipeline
pipe = StableDiffusionInpaintPipeline.from_pretrained("CompVis/stable-diffusion-v1-4")
assert pipe.__class__ .__name__ == "StableDiffusionInpaintPipelineLegacy"
Context:
Why this change? When Stable Diffusion came out ~2 months ago, there were many unofficial in-painting demos using the original v1-4 checkpoint ("CompVis/stable-diffusion-v1-4"). These demos worked reasonably well, so that we integrated an experimental StableDiffusionInpaintPipeline class into diffusers. Now that the official inpainting checkpoint was released: https://github.com/runwayml/stable-diffusion we decided to make this our official pipeline and move the old / hacky one to "StableDiffusionInpaintPipelineLegacy".
Thanks to the contribution by @zledas (#552) this release supports OnnxStableDiffusionImg2ImgPipeline and OnnxStableDiffusionInpaintPipeline optimized for CPU inference:
from diffusers import OnnxStableDiffusionImg2ImgPipeline, OnnxStableDiffusionInpaintPipeline
img_pipeline = OnnxStableDiffusionImg2ImgPipeline.from_pretrained(
"CompVis/stable-diffusion-v1-4", revision="onnx", provider="CPUExecutionProvider"
)
inpaint_pipeline = OnnxStableDiffusionInpaintPipeline.from_pretrained(
"runwayml/stable-diffusion-inpainting", revision="onnx", provider="CPUExecutionProvider"
)
Two new community pipelines have been added to diffusers :fire:
Interpolate the latent space of Stable Diffusion between different prompts/seeds. For more info see stable-diffusion-videos.
For a code example, see Stable Diffusion Interpolation
One Stable Diffusion Pipeline with all functionalities of Text2Image, Image2Image and Inpainting
For a code example, see Stable Diffusion Mega
This patch release fixes an bug with Flax's NFSW safety checker in the pipeline.
https://github.com/huggingface/diffusers/pull/832 by @patil-suraj
We added JAX support for Stable Diffusion! You can now run Stable Diffusion on Colab TPUs (and GPUs too!) for faster inference.
Check out this TPU-ready colab for a Stable Diffusion pipeline: And a detailed blog post on Stable Diffusion and parallelism in JAX / Flax :hugs: https://huggingface.co/blog/stable_diffusion_jax
The most used models, schedulers and pipelines have been ported to JAX/Flax, namely:
FlaxAutoencoderKL, FlaxUNet2DConditionModelFlaxDDIMScheduler, FlaxDDIMScheduler, FlaxPNDMSchedulerFlaxStableDiffusionPipelineChangelog:
Thanks to the :hugs: accelerate integration with DeepSpeed, a few of our training examples became even more optimized in terms of VRAM and speed:
norm_num_groups by @akash5474 in #789This patch release allows the img2img pipeline to be run on fp16 and fixes a bug with the "mps" device.