ModularPipelines with AutoModel type hints in their modular_model_index.json #13271torchvision import in Cosmos Predict 2.5 #13321Modular Diffusers introduces a new way to build diffusion pipelines by composing reusable blocks. Instead of writing entire pipelines from scratch, you can now mix and match building blocks to create custom workflows tailored to your specific needs! This complements the existing DiffusionPipeline class, providing a more flexible way to create custom diffusion pipelines.
Find more details on how to get started with Modular Diffusers here, and also check out the announcement post.
@apply_lora_scale decorator for simplifying model definitions (#12994)device_map (#12811)A lot of the above features/improvements came as part of the MVP program we have been running. Immense thanks to the contributors!
T5Tokenizer for Transformers v5.0+ compatibility (#12877)num_videos_per_prompt > 1 and CFG (#13121)txt_seq_lens handling (#12702)prefix_token_len bug (#12845)is_fsdp determination (#12960)get_image_features API (#13052)aiter availability check (#13059)prompt and prior_token_ids simultaneously in GlmImagePipeline (#13092)OvisImagePipeline in AUTO_TEXT2IMAGE_PIPELINES_MAPPING by @alvarobartt in #12876T5Tokenizer instead of MT5Tokenizer (removed in Transformers v5.0+) by @alvarobartt in #12877AutoencoderMixin by @sayakpaul in #12873enable_auto_cpu_offload by @sayakpaul in #12578is_fsdp is determined by @sayakpaul in #12960PeftLoraLoaderMixinTests to Enable/Disable Text Encoder LoRA Tests by @dg845 in #12962disable_mmap in pipeline from_pretrained by @hlky in #12854ChromaInpaintPipeline by @hameerabbasi in #12848*pooled_* mentions from Chroma inpaint by @hameerabbasi in #13026from_single_file method for WanAnimateTransformer3DModel by @samadwar in #12691get_image_features API by @JaredforReal in #13052ModularPipeline.blocks attribute by @yiyixuxu in #13014encode_video by Accepting More Input Types by @dg845 in #13057prompt and prior_token_ids to be provided simultaneously in GlmImagePipeline by @JaredforReal in #13092num_videos_per_prompt > 1 and CFG is Enabled by @dg845 in #13121setuptools pkg_resources Errors by @dg845 in #13129setuptools pkg_resources Bug for PR GPU Tests by @dg845 in #13132typing exports where possible by @sayakpaul in #12524ModularPipeline by @yiyixuxu in #13100setuptools CI Fix as the Failing Pipelines are Deprecated by @dg845 in #13149ftfy import for PRX Pipeline by @dg845 in #13154from_config with custom code. by @DN6 in #13123typing Import Error by @dg845 in #13178transformers v5 by @sayakpaul in #12976do_classifier_free_guidance threshold in ZImagePipeline by @kirillsst in #13183auto_map to model config by @DN6 in #13186do_classifier_free_guidance thresholds by @asomoza in #13212The following contributors have made significant changes to the library over the last release:
ModularPipeline.blocks attribute (#13014)ModularPipeline (#13100)AutoencoderMixin (#12873)enable_auto_cpu_offload (#12578)is_fsdp is determined (#12960)typing exports where possible (#12524)transformers v5 (#12976)from_config with custom code. (#13123)auto_map to model config (#13186)disable_mmap in pipeline from_pretrained (#12854)PeftLoraLoaderMixinTests to Enable/Disable Text Encoder LoRA Tests (#12962)encode_video by Accepting More Input Types (#13057)num_videos_per_prompt > 1 and CFG is Enabled (#13121)setuptools pkg_resources Errors (#13129)setuptools pkg_resources Bug for PR GPU Tests (#13132)setuptools CI Fix as the Failing Pipelines are Deprecated (#13149)ftfy import for PRX Pipeline (#13154)typing Import Error (#13178)ChromaInpaintPipeline (#12848)*pooled_* mentions from Chroma inpaint (#13026)get_image_features API (#13052)prompt and prior_token_ids to be provided simultaneously in GlmImagePipeline (#13092)The release features a number of new image and video pipelines, a new caching method, a new training script, new kernels - powered attention backends, and more. It is quite packed with a lot of new stuff, so make sure you read the release notes fully 🚀
kernels-powered attention backendsThe kernels library helps you save a lot of time by providing pre-built kernel interfaces for various environments and accelerators. This release features three new kernels-powered attention backends:
varlen variant)varlen variant)This means if any of the above backend is supported by your development environment, you should be able to skip the manual process of building the corresponding kernels and just use:
# Make sure you have `kernels` installed: `pip install kernels`.
# You can choose `flash_hub` or `sage_hub`, too.
pipe.transformer.set_attention_backend("_flash_3_hub")
For more details, check out the documentation.
TaylorSeer is now supported in Diffusers, delivering upto 3x speedups with negligible-to-none quality compromise. Thanks to @toilaluan for contributing this in https://github.com/huggingface/diffusers/pull/12648. Check out the documentation here.
Our Flux.2 integration features a LoRA fine-tuning script that you can check out here. We provide a number of optimizations to help make it run on consumer GPUs.
AttentionMixin: Making certain compatible models subclass from the AttentionMixin class helped us get rid of 2K LoC. Going forward, users can expect more such refactorings that will help make the library leaner and simpler. Check out https://github.com/huggingface/diffusers/pull/12463 for more details.VAETesterMixin to consolidate tests for slicing and tiling by @sayakpaul in #12374AutoencoderMixin to abstract common methods by @sayakpaul in #12473upper() by @sayakpaul in #12479lodestones/Chroma1-HD by @josephrocca in #12508local_dir by @DN6 in #12381testing_utils.py by @DN6 in #12621test_save_load_float16 by @kaixuanliu in #12500SanaImageToVideoPipeline support by @lawrence-cj in #12634AutoencoderKLWan's dim_mult default value back to list by @dg845 in #12640kernels by @sayakpaul in #12439record_stream in group offloading is not working properly by @KimbingNg in #12721AttentionMixin for compatible classes by @sayakpaul in #12463upcast_vae in SDXL based pipelines by @DN6 in #12619from_single_file by @hlky in #12756The following contributors have made significant changes to the library over the last release:
AutoencoderKLWan's dim_mult default value back to list (#12640)local_dir (#12381)testing_utils.py (#12621)upcast_vae in SDXL based pipelines (#12619)SanaImageToVideoPipeline support (#12634)Thanks to @naykun for the following PRs that improve Qwen-Image Edit:
This release comes packed with new image generation and editing pipelines, a new video pipeline, new training scripts, quality-of-life improvements, and much more. Read the rest of the release notes fully to not miss out on the fun stuff.
We welcomed new pipelines in this release:
This update to Wan provides significant improvements in video fidelity, prompt adherence, and style. Please check out the official doc to learn more.
Flux-Kontext is a 12-billion-parameter rectified flow transformer capable of editing images based on text instructions. Please check out the official doc to learn more about it.
After a successful run of delivering language models and vision-language models, the Qwen team is back with an image generation model, which is Apache-2.0 licensed! It achieves significant advances in complex text rendering and precise image editing. To learn more about this powerful model, refer to our docs.
Thanks to @naykun for contributing both Qwen-Image and Qwen-Image-Edit via this PR and this PR.
Make these newly added models your own with our training scripts:
Following the 🤗 Transformers’ philosophy of single-file modeling implementations, we have started implementing modeling code in single and self-contained files. The Flux Transformer code is one example of this.
We have massively refactored how we do attention in the models. This allows us to provide support for different attention backends (such as PyTorch native scaled_dot_product_attention, Flash Attention 3, SAGE attention, etc.) in the library seamlessly.
Having attention supported this way also allows us to integrate different parallelization mechanisms, which we’re actively working on. Follow this PR if you’re interested.
Users shouldn’t be affected at all by these changes. Please open an issue if you face any problems.
Regional compilation trims cold-start latency by only compiling the small and frequently-repeated block(s) of a model - typically a transformer layer - and enables reusing compiled artifacts for every subsequent occurrence. For many diffusion architectures, this delivers the same runtime speedups as full-graph compilation and reduces compile time by 8–10x. Refer to this doc to learn more.
Thanks to @anijain2305 for contributing this feature in this PR.
We have also authored a number of posts that center around the use of torch.compile. You can check them out at the links below:
Users can now load pipelines directly on an accelerator device leading to significantly faster load times. This particularly becomes evident when loading large pipelines like Wan and Qwen-Image.
from diffusers import DiffusionPipeline
import torch
ckpt_id = "Qwen/Qwen-Image"
pipe = DiffusionPipeline.from_pretrained(
- ckpt_id, torch_dtype=torch.bfloat16
- ).to("cuda")
+ ckpt_id, torch_dtype=torch.bfloat16, device_map="cuda"
+ )
You can speed up loading even more by enabling parallelized loading of state dict shards. This is particularly helpful when you’re working with large models like Wan and Qwen-Image, where the model state dicts are typically sharded across multiple files.
import os
os.environ["HF_ENABLE_PARALLEL_LOADING"] = "yes"
# rest of the loading code
....
@Isotr0py contributed support for native GGUF CUDA kernels in this PR. This should provide an approximately 10% improvement in inference speed.
We have also worked on a tool for converting regular checkpoints to GGUF, letting the community easily share their GGUF checkpoints. Learn more here.
We now support loading of Diffusers format GGUF checkpoints.
You can learn more about all of this in our GGUF official docs.
Modular Diffusers is a system for building diffusion pipelines pipelines with individual pipeline blocks. It is highly customisable, with blocks that can be mixed and matched to adapt to or create a pipeline for a specific workflow or multiple workflows.
The API is currently in active development and is being released as an experimental feature. Learn more in our docs.
test_float16_inference in unit test by @kaixuanliu in #11809from_single_file method for WanVACE3DTransformer by @J4BEZ in #11807exclude_modules with Wan VACE by @sayakpaul in #11843_keep_in_fp32_modules by @a-r-r-o-w in #11851Transformer2DModel and finegrained variants. by @sayakpaul in #11947guidance_scale docstring for guidance_distilled models. by @sayakpaul in #11935prompt_2 optional in Flux Pipelines by @DN6 in #12073transformer. in key by @Beinsezii in #12101lightx2v/Qwen-Image-Lightning by @sayakpaul in #12119local_files_only=True when using sharded checkpoints by @sayakpaul in #12005hf_quantizer in cache warmup. by @sayakpaul in #12043The following contributors have made significant changes to the library over the last release:
Wan VACE supports various generation techniques which achieve controllable video generation. It comes in two variants: a 1.3B model for fast iteration & prototyping, and a 14B for high quality generation. Some of the capabilities include:
The code snippets available in this pull request demonstrate some examples of how videos can be generated with controllability signals.
Check out the docs to learn more.
Cosmos-Predict2 is a key branch of the Cosmos World Foundation Models (WFMs) ecosystem for Physical AI, specializing in future state prediction through advanced world modeling. It offers two powerful capabilities: text-to-image generation for creating high-quality images from text descriptions, and video-to-world generation for producing visual simulations from video inputs.
The Video2World model comes in a 2B and 14B variant. Check out the docs to learn more.
LTX 0.9.7 and its distilled variants are the latest in the family of models released by Lightricks.
Check out the docs to learn more.
Framepack is a novel method for enabling long video generation. There are two released variants of Hunyuan Video trained using this technique. Check out the docs to learn more.
The FusionX family of models and LoRAs, built on top of Wan2.1-14B, should already be supported. To load the model, use from_single_file():
from diffusers import WanTransformer3DModel
transformer = WanTransformer3DModel.from_single_file(
"https://huggingface.co/vrgamedevgirl84/Wan14BT2VFusioniX/blob/main/Wan14Bi2vFusioniX_fp16.safetensors",
torch_dtype=torch.bfloat16
)
To load the LoRAs, use load_lora_weights():
pipe = DiffusionPipeline.from_pretrained(
"Wan-AI/Wan2.1-T2V-14B-Diffusers",
torch_dtype=torch.bfloat16
).to("cuda")
pipe.load_lora_weights(
"vrgamedevgirl84/Wan14BT2VFusioniX", weight_name="FusionX_LoRa/Wan2.1_T2V_14B_FusionX_LoRA.safetensors"
)
AccVideo and CausVid are two novel distillation techniques that speed up the generation time of video diffusion models while preserving quality. Diffusers supports loading their extracted LoRAs with their respective models.
Text-to-image models from the Cosmos-Predict2 release. The models comes in a 2B and 14B variant. Check out the docs to learn more.
Chroma is a 8.9B parameter model based on FLUX.1-schnell. It’s fully Apache 2.0 licensed, ensuring that anyone can use, modify, and build on top of it. Checkout the docs to learn more
Thanks to @Ednaordinary for contributing it in this PR!
VisualCloze: A Universal Image Generation Framework via Visual In-Context Learning is an innovative in-context learning framework based universal image generation framework that offers key capabilities:
Check out the docs to learn more. Thanks to @lzyhha for contributing this in this PR!
torch.compile supportWe have worked with the PyTorch team to improve how we provide torch.compile() compatibility throughout the library. More specifically, we now test the widely used models like Flux for any recompilation and graph break issues which can get in the way of fully realizing torch.compile() benefits. Refer to the following links to learn more:
Additionally, users can combine offloading with compilation to get a better speed-memory trade-off. Below is an example:
<details> <summary>Code</summary>import torch
from diffusers import DiffusionPipeline
torch._dynamo.config.cache_size_limit = 10000
pipeline = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-schnell", torch_dtype=torch.bfloat16
)
pipline.enable_model_cpu_offload()
# Compile.
pipeline.transformer.compile()
image = pipeline(
prompt="An astronaut riding a horse on Mars",
guidance_scale=0.,
height=768,
width=1360,
num_inference_steps=4,
max_sequence_length=256,
).images[0]
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
</details>
This is compatible with group offloading, too. Interested readers can check out the concerned PRs below:
You can substantially reduce memory requirements by combining quantization with offloading and then improving speed with torch.compile(). Below is an example:
from diffusers import BitsAndBytesConfig as DiffusersBitsAndBytesConfig
from transformers import BitsAndBytesConfig as TransformersBitsAndBytesConfig
from diffusers import AutoModel, FluxPipeline
from transformers import T5EncoderModel
import torch
torch._dynamo.config.recompile_limit = 1000
quant_kwargs = {"load_in_4bit": True, "bnb_4bit_compute_dtype": torch_dtype, "bnb_4bit_quant_type": "nf4"}
text_encoder_2_quant_config = TransformersBitsAndBytesConfig(**quant_kwargs)
dit_quant_config = DiffusersBitsAndBytesConfig(**quant_kwargs)
ckpt_id = "black-forest-labs/FLUX.1-dev"
text_encoder_2 = T5EncoderModel.from_pretrained(
ckpt_id,
subfolder="text_encoder_2",
quantization_config=text_encoder_2_quant_config,
torch_dtype=torch_dtype,
)
transformer = AutoModel.from_pretrained(
ckpt_id,
subfolder="transformer",
quantization_config=dit_quant_config,
torch_dtype=torch_dtype,
)
pipe = FluxPipeline.from_pretrained(
ckpt_id,
transformer=transformer,
text_encoder_2=text_encoder_2,
torch_dtype=torch_dtype,
)
pipe.enable_model_cpu_offload()
pipe.transformer.compile()
image = pipeline(
prompt="An astronaut riding a horse on Mars",
guidance_scale=3.5,
height=768,
width=1360,
num_inference_steps=28,
max_sequence_length=512,
).images[0]
</details>
Starting from bitsandbytes==0.46.0 onwards, bnb-quantized models should be fully compatible with torch.compile() without graph-breaks. This means that when compiling a bnb-quantized model, users can do: model.compile(fullgraph=True). This can significantly improve speed while still providing memory benefits. The figure below provides a comparison with Flux.1-Dev. Refer to this benchmarking script to learn more.
Note that for 4bit bnb models, it’s currently needed to install PyTorch nightly if fullgraph=True is specified during compilation.
Huge shoutout to @anijain2305 and @StrongerXi from the PyTorch team for the incredible support.
Users can now provide a quantization config while initializing a pipeline:
import torch
from diffusers import DiffusionPipeline
from diffusers.quantizers import PipelineQuantizationConfig
pipeline_quant_config = PipelineQuantizationConfig(
quant_backend="bitsandbytes_4bit",
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
components_to_quantize=["transformer", "text_encoder_2"],
)
pipe = DiffusionPipeline.from_pretrained(
"black-forest-labs/FLUX.1-dev",
quantization_config=pipeline_quant_config,
torch_dtype=torch.bfloat16,
).to("cuda")
image = pipe("photo of a cute dog").images[0]
This reduces the barrier to entry for our users willing to use quantization without having to write too much code. Refer to the documentation to learn more about different configurations allowed through PipelineQuantizationConfig.
In the previous release, we shipped “group offloading” which lets you offload blocks/nodes within a model, optimizing its memory consumption. It also lets you overlap this offloading with computation, providing a good speed-memory trade-off, especially in low VRAM environments.
However, you still need a considerable amount of system RAM to make offloading work effectively. So, low VRAM and low RAM environments would still not work.
Starting this release, users will additionally have the option to offload to disk instead of RAM, further lowering memory consumption. Set the offload_to_disk_path to enable this feature.
pipeline.transformer.enable_group_offload(
onload_device="cuda",
offload_device="cpu",
offload_type="leaf_level",
offload_to_disk_path="path/to/disk"
)
Refer to these two tables to compare the speed and memory trade-offs.
It is beneficial to include the LoraConfig in a LoRA state dict that was used to train the LoRA. In its absence, users were restricted to using the same LoRA alpha as the LoRA rank. We have modified the most popular training scripts to allow passing custom lora_alpha through the CLI. Refer to this thread for more updates. Refer to this comment for some extended clarifications.
We have worked on a two-part series discussing the support of quantization in Diffusers. Check them out:
DIFFUSERS_REQUEST_TIMEOUT for notification bot by @sayakpaul in #11273KolorsPipelineFastTests::test_inference_batch_single_identical pass on XPU by @faaany in #11313skrample section to community_projects.md by @Beinsezii in #11319AutoModel usage by @sayakpaul in #11300transformers>4.47.1 by @DN6 in #11293hotswap better by @sayakpaul in #11333StableDiffusionXLControlNetAdapterInpaintPipeline incorrectly inherited StableDiffusionLoraLoaderMixin by @Kazuki-Yoda in #11357torch.compile fullgraph compatibility for Hunyuan Video by @a-r-r-o-w in #11457peft by @sayakpaul in #11502removeprefix to preserve sanity. by @sayakpaul in #11493torch.compile() with LoRA hotswapping by @sayakpaul in #11322torch_dtype="auto" option from docstrings by @johannaSommer in #11513load_lora_weights() for Flux and a test by @sayakpaul in #11595variant and safetensor file does not match by @kaixuanliu in #11587torch.compile() CI and tests by @sayakpaul in #11508Wan] Fix VAE sampling mode in WanVideoToVideoPipeline by @tolgacangoz in #11639device_map clarifications by @sayakpaul in #11681AutoencoderKLWan.clear_cache by 886% by @misrasaurabh1 in #11665is_compileable property to quantizers. by @sayakpaul in #11736PipelineQuantizationConfig by @sayakpaul in #11750apply_rotary_emb functions' comments by @tolgacangoz in #11717return by @sayakpaul in #11771The following contributors have made significant changes to the library over the last release:
transformers>4.47.1 (#11293)Wan2.1 is a comprehensive and open suite of video foundation models that pushes the boundaries of video generation. The model release includes 4 different model variants and three different pipelines for Text to Video, Image to Video and Video to Video.
Wan-AI/Wan2.1-T2V-1.3B-DiffusersWan-AI/Wan2.1-T2V-14B-DiffusersWan-AI/Wan2.1-I2V-14B-480P-DiffusersWan-AI/Wan2.1-I2V-14B-720P-DiffusersCheck out the docs here to learn more.
LTX Video 0.9.5 is the updated version of the super-fast LTX Video model series. The latest model introduces additional conditioning options, such as keyframe-based animation and video extension (both forward and backward).
To support these additional conditioning inputs, we’ve introduced the LTXConditionPipeline and LTXVideoCondition object.
To learn more about the usage, check out the docs here.
Hunyuan utilizes a pre-trained Multimodal Large Language Model (MLLM) with a Decoder-Only architecture as the text encoder. The input image is processed by the MLLM to generate semantic image tokens. These tokens are then concatenated with the video latent tokens, enabling comprehensive full-attention computation across the combined data and seamlessly integrating information from both the image and its associated caption.
To learn more, check out the docs here.
SANA-Sprint is an efficient diffusion model for ultra-fast text-to-image generation. SANA-Sprint is built on a pre-trained foundation model and augmented with hybrid distillation, dramatically reducing inference steps from 20 to 1-4, rivaling the quality of models like Flux.
Shoutout to @lawrence-cj for their help and guidance on this PR.
Check out the pipeline docs of SANA-Sprint to learn more.
Lumina-Image-2.0 is a 2B parameter flow-based diffusion transformer for text-to-image generation released under the Apache 2.0 license.
Check out the docs to learn more. Thanks to @zhuole1025 for contributing this through this PR.
One can also LoRA fine-tune Lumina2, taking advantage of its Apach2.0 licensing. Check out the guide for more details.
OmniGen is a unified image generation model that can handle multiple tasks including text-to-image, image editing, subject-driven generation, and various computer vision tasks within a single framework. The model consists of a VAE, and a single transformer based on Phi-3 that handles text and image encoding as well as the diffusion process.
Check out the docs to learn more about OmniGen. Thanks to @staoxiao for contributing OmniGen in this PR.
PyTorch supports torch.float8_e4m3fn and torch.float8_e5m2 as weight storage dtypes, but they can’t be used for computation on many devices due to unimplemented kernel support.
However, you can still use these dtypes to store model weights in FP8 precision and upcast them to a widely supported dtype such as torch.float16 or torch.bfloat16 on-the-fly when the layers are used in the forward pass. This is known as layerwise weight-casting. This can potentially cut down the VRAM requirements of a model by 50%.
import torch
from diffusers import CogVideoXPipeline, CogVideoXTransformer3DModel
from diffusers.utils import export_to_video
model_id = "THUDM/CogVideoX-5b"
# Load the model in bfloat16 and enable layerwise casting
transformer = CogVideoXTransformer3DModel.from_pretrained(model_id, subfolder="transformer", torch_dtype=torch.bfloat16)
transformer.enable_layerwise_casting(storage_dtype=torch.float8_e4m3fn, compute_dtype=torch.bfloat16)
# Load the pipeline
pipe = CogVideoXPipeline.from_pretrained(model_id, transformer=transformer, torch_dtype=torch.bfloat16)
pipe.to("cuda")
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
export_to_video(video, "output.mp4", fps=8)
</details>
Group offloading is the middle ground between sequential and model offloading. It works by offloading groups of internal layers (either torch.nn.ModuleList or torch.nn.Sequential), which uses less memory than model-level offloading. It is also faster than sequential-level offloading because the number of device synchronizations is reduced.
On CUDA devices, we also have the option to enable using layer prefetching with CUDA Streams. The next layer to be executed is loaded onto the accelerator device while the current layer is being executed which makes inference substantially faster while still keeping VRAM requirements very low. With this, we introduce the idea of overlapping computation with data transfer.
One thing to note is that using CUDA streams can cause a considerable spike in CPU RAM usage. Please ensure that the available CPU RAM is 2 times the size of the model if you choose to set use_stream=True. You can reduce CPU RAM usage by setting low_cpu_mem_usage=True. This should limit the CPU RAM used to be roughly the same as the size of the model, but will introduce slight latency in the inference process.
You can also use record_stream=True when using use_stream=True to obtain more speedups at the expense of slightly increased memory usage.
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
# We can utilize the enable_group_offload method for Diffusers model implementations
pipe.transformer.enable_group_offload(
onload_device=onload_device,
offload_device=offload_device,
offload_type="leaf_level",
use_stream=True
)
prompt = (
"A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. "
"The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other "
"pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, "
"casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. "
"The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical "
"atmosphere of this unique musical performance."
)
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
# This utilized about 14.79 GB. It can be further reduced by using tiling and using leaf_level offloading throughout the pipeline.
print(f"Max memory reserved: {torch.cuda.max_memory_allocated() / 1024**3:.2f} GB")
export_to_video(video, "output.mp4", fps=8)
</details>
Group offloading can also be applied to non-Diffusers models such as text encoders from the transformers library.
import torch
from diffusers import CogVideoXPipeline
from diffusers.hooks import apply_group_offloading
from diffusers.utils import export_to_video
# Load the pipeline
onload_device = torch.device("cuda")
offload_device = torch.device("cpu")
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
# For any other model implementations, the apply_group_offloading function can be used
apply_group_offloading(pipe.text_encoder, onload_device=onload_device, offload_type="block_level", num_blocks_per_group=2)
</details>
Remote components are an experimental feature designed to offload memory-intensive steps of the inference pipeline to remote endpoints. The initial implementation focuses primarily on VAE decoding operations. Below are the currently supported model endpoints:
| Model | Endpoint | Model |
|---|---|---|
| Stable Diffusion v1 | https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud | stabilityai/sd-vae-ft-mse |
| Stable Diffusion XL | https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud | madebyollin/sdxl-vae-fp16-fix |
| Flux | https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud | black-forest-labs/FLUX.1-schnell |
| HunyuanVideo | https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud | hunyuanvideo-community/HunyuanVideo |
This is an example of using remote decoding with the Hunyuan Video pipeline:
<details> <summary>Code</summary>from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
model_id = "hunyuanvideo-community/HunyuanVideo"
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
model_id, subfolder="transformer", torch_dtype=torch.bfloat16
)
pipe = HunyuanVideoPipeline.from_pretrained(
model_id, transformer=transformer, vae=None, torch_dtype=torch.float16
).to("cuda")
latent = pipe(
prompt="A cat walks on the grass, realistic",
height=320,
width=512,
num_frames=61,
num_inference_steps=30,
output_type="latent",
).frames
video = remote_decode(
endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
tensor=latent,
output_type="mp4",
)
if isinstance(video, bytes):
with open("video.mp4", "wb") as f:
f.write(video)
</details>
Check out the docs to know more.
Cached Inference for Diffusion Transformer models is a performance optimization that significantly accelerates the denoising process by caching intermediate values. This technique reduces redundant computations across timesteps, resulting in faster generation with a slight dip in output quality.
Check out the docs to learn more about the available caching methods.
Pyramind Attention Broadcast
<details> <summary>Code</summary>import torch
from diffusers import CogVideoXPipeline, PyramidAttentionBroadcastConfig
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
pipe.to("cuda")
config = PyramidAttentionBroadcastConfig(
spatial_attention_block_skip_range=2,
spatial_attention_timestep_skip_range=(100, 800),
current_timestep_callback=lambda: pipe.current_timestep,
)
pipe.transformer.enable_cache(config)
</details>
FasterCache
<details> <summary>Code</summary>import torch
from diffusers import CogVideoXPipeline, FasterCacheConfig
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b", torch_dtype=torch.bfloat16)
pipe.to("cuda")
config = FasterCacheConfig(
spatial_attention_block_skip_range=2,
spatial_attention_timestep_skip_range=(-1, 901),
unconditional_batch_skip_range=2,
attention_weight_callback=lambda _: 0.5,
is_guidance_distilled=True,
)
pipe.transformer.enable_cache(config)
</details>
Diffusers now has support for the Quanto quantization backend, which provides float8 , int8 , int4 and int2 quantization dtypes.
import torch
from diffusers import FluxTransformer2DModel, QuantoConfig
model_id = "black-forest-labs/FLUX.1-dev"
quantization_config = QuantoConfig(weights_dtype="float8")
transformer = FluxTransformer2DModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
)
Quanto int8 models are also compatible with torch.compile :
import torch
from diffusers import FluxTransformer2DModel, QuantoConfig
model_id = "black-forest-labs/FLUX.1-dev"
quantization_config = QuantoConfig(weights_dtype="float8")
transformer = FluxTransformer2DModel.from_pretrained(
model_id,
subfolder="transformer",
quantization_config=quantization_config,
torch_dtype=torch.bfloat16,
)
transformer.compile()
</details>
uintx TorchAO checkpoints with torch>=2.6TorchAO checkpoints currently have to be serialized using pickle. For some quantization dtypes using the uintx format, such as uint4wo this involves saving subclassed TorchAO Tensor objects in the model file. This made loading the models directly with Diffusers a bit tricky since we do not allow deserializing artbitary Python objects from pickle files.
Torch 2.6 allows adding expected Tensors to torch safe globals, which lets us directly load TorchAO checkpoints with these objects.
- state_dict = torch.load("/path/to/flux_uint4wo/diffusion_pytorch_model.bin", weights_only=False, map_location="cpu")
- with init_empty_weights():
- transformer = FluxTransformer2DModel.from_config("/path/to/flux_uint4wo/config.json")
- transformer.load_state_dict(state_dict, strict=True, assign=True)
+ transformer = FluxTransformer2DModel.from_pretrained("/path/to/flux_uint4wo/")
We have shipped a couple of improvements on the LoRA front in this release.
🚨 Improved coverage for loading non-diffusers LoRA checkpoints for Flux
Take note of the breaking change introduced in this PR 🚨 We suggest you upgrade your peft installation to the latest version - pip install -U peft especially when dealing with Flux LoRAs.
torch.compile() support when hotswapping LoRAs without triggering recompilation
A common use case when serving multiple adapters is to load one adapter first, generate images, load another adapter, generate more images, load another adapter, etc. This workflow normally requires calling load_lora_weights(), set_adapters(), and possibly delete_adapters() to save memory. Moreover, if the model is compiled using torch.compile, performing these steps requires recompilation, which takes time.
To better support this common workflow, you can “hotswap” a LoRA adapter, to avoid accumulating memory and in some cases, recompilation. It requires an adapter to already be loaded, and the new adapter weights are swapped in-place for the existing adapter.
Check out the docs to learn more about this feature.
The other major change is the support for
dtype Maps for PipelinesSince various pipelines require their components to run in different compute dtypes, we now support passing a dtype map when initializing a pipeline:
from diffusers import HunyuanVideoPipeline
import torch
pipe = HunyuanVideoPipeline.from_pretrained(
"hunyuanvideo-community/HunyuanVideo",
torch_dtype={"transformer": torch.bfloat16, "default": torch.float16},
)
print(pipe.transformer.dtype, pipe.vae.dtype) # (torch.bfloat16, torch.float16)
This release includes an AutoModel object similar to the one found in transformers that automatically fetches the appropriate model class for the provided repo.
from diffusers import AutoModel
unet = AutoModel.from_pretrained("runwayml/stable-diffusion-v1-5", subfolder="unet")
StableDiffusion3Img2ImgPipeline by @guiyrt in #10589requires_grad False. by @sayakpaul in #10607scheduling_ddpm.py docs by @JacobHelwig in #10648fp8_e4m3_bf16_max_memory < fp8_e4m3_fp32_max_memory by @sayakpaul in #10669Model Search by @suzukimain in #10417Self type hint to ModelMixin's from_pretrained by @hlky in #10742use_lu_lambdas and use_karras_sigmas with beta_schedule=squaredcos_cap_v2 in DPMSolverMultistepScheduler by @hlky in #10740MultiControlNetUnionModel on SDXL by @guiyrt in #10747to+FromOriginalModelMixin/FromSingleFileMixin from_single_file type hint by @hlky in #10811set_adapters() robust on silent failures. by @sayakpaul in #9618isinstance for arg checks in GGUFParameter by @AstraliteHeart in #10834encode_prompt() in isolation by @sayakpaul in #10438generation_config in pipeline by @JeffersonQin in #10779main by @sayakpaul in #10289device_map in load_model_dict_into_meta by @hlky in #10851from_pretrained kwargs by @guiyrt in #10758torch_dtype in Kolors text encoder with transformers v4.49 by @hlky in #10816remote_decode to remote_utils by @hlky in #10898huggingface_hub by @hanouticelina in #10970num_train_epochs is passed in a distributed training env by @flyxiv in #10973Research Project] Add AnyText: Multilingual Visual Text Generation And Editing by @tolgacangoz in #8998output_size in repeat_interleave by @hlky in #11030formatted_images initialization compact by @YanivDorGalron in #10801export_to_video by @hlky in #11090torch_dtype and don't use when quantization_config is set by @hlky in #11039load_lora_adapter in PeftAdapterMixin class by @kentdan3msu in #11155installation.md by @remarkablemark in #11179latents_mean and latents_std to SDXLLongPromptWeightingPipeline by @hlky in #11034F.pad by @bm-synth in #10620save_model in ModelMixin save_pretrained and use safe_serialization=False in test by @hlky in #11196torch_dtype map by @hlky in #11194LTXConditionPipeline for text-only conditioning by @tolgacangoz in #11174record_stream when using CUDA streams during group offloading by @sayakpaul in #11081The following contributors have made significant changes to the library over the last release:
StableDiffusion3Img2ImgPipeline (#10589)MultiControlNetUnionModel on SDXL (#10747)from_pretrained kwargs (#10758)Model Search (#10417)Research Project] Add AnyText: Multilingual Visual Text Generation And Editing (#8998)LTXConditionPipeline for text-only conditioning (#11174)This patch release
unload_lora_weights for Flux Controlload_lora_into_text_encoder() and fuse_lora() copied from by @sayakpaul in #10495unload_lora_weights() for Flux Control. by @sayakpaul in #10206This patch release fixes a few bugs related to the TorchAO Quantizer introduced in v0.32.0.
Refer to our documentation to learn more about how to use different quantization backends.
https://github.com/user-attachments/assets/34d5f7ca-8e33-4401-8109-5c245ce7595f
This release took a while, but it has many exciting updates. It contains several new pipelines for image and video generation, new quantization backends, and more.
Going forward, to provide more transparency to the community about ongoing developments and releases in Diffusers, we will be making use of a roadmap tracker.
Open video generation models are on the rise, and we’re pleased to provide comprehensive integration support for all of them. The following video pipelines are bundled in this release:
Check out this section to learn more about the fine-tuning options available for these new video models.
Important Note about the new Flux Models
We can combine the regular Flux.1 Dev LoRAs with Flux Control LoRAs, Flux Control, and Flux Fill. For example, you can enable few-steps inference with Flux Fill using:
from diffusers import FluxFillPipeline
from diffusers.utils import load_image
import torch
pipe = FluxFillPipeline.from_pretrained(
"black-forest-labs/FLUX.1-Fill-dev", torch_dtype=torch.bfloat16
).to("cuda")
adapter_id = "alimama-creative/FLUX.1-Turbo-Alpha"
pipe.load_lora_weights(adapter_id)
image = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup.png")
mask = load_image("https://huggingface.co/datasets/diffusers/diffusers-images-docs/resolve/main/cup_mask.png")
image = pipe(
prompt="a white paper cup",
image=image,
mask_image=mask,
height=1632,
width=1232,
guidance_scale=30,
num_inference_steps=8,
max_sequence_length=512,
generator=torch.Generator("cpu").manual_seed(0)
).images[0]
image.save("flux-fill-dev.png")
To learn more, check out the documentation.
[!NOTE]
SANA is a small model compared to other models like Flux and Sana-0.6B can be deployed on a 16GB laptop GPU, taking less than 1 second to generate a 1024×1024 resolution image. We support LoRA fine-tuning of SANA. Check out this section for more details.
Please be aware of the following caveats:
safetensors currently. This may change in the future.This release features many new training scripts for the community to play:
require_accelerate_version_greater by @faaany in #9746load_lora_adapter() for compatible models by @sayakpaul in #9712controlnet module by @sayakpaul in #8768AttentionProcessor type by @Prgckwb in #9909save_lora_adapter() by @sayakpaul in #9862pipelines tests device-agnostic (part1) by @faaany in #9399beta, exponential and karras sigmas to FlowMatchEulerDiscreteScheduler by @hlky in #10001pipelines tests device-agnostic (part2) by @faaany in #9400sigmas to Flux pipelines by @hlky in #10081num_images_per_prompt>1 with Skip Guidance Layers in StableDiffusion3Pipeline by @hlky in #10086sigmas to np.array in FlowMatch set_timesteps by @hlky in #10088skip_guidance_layers in SD3 pipeline by @hlky in #10102pipeline_stable_audio formating by @hlky in #10114sigmas to pipelines using FlowMatch by @hlky in #10116bnb by @ariG23498 in #10012Civitai into Existing Pipelines by @suzukimain in #9986revision argument when loading single file config by @a-r-r-o-w in #10168torch in get_3d_rotary_pos_embed/_allegro by @hlky in #10161set_adapters() and attn kwargs outs match by @sayakpaul in #10110negative_* from SDXL callback by @hlky in #10203torch in get_2d_sincos_pos_embed and get_3d_sincos_pos_embed by @hlky in #10156SanaPipeline, SanaPAGPipeline, LinearAttentionProcessor, Flow-based DPM-sovler and so on. by @lawrence-cj in #9982t instead of timestep in _apply_perturbed_attention_guidance by @hlky in #10243dynamic_shifting to SD3 by @hlky in #10236use_flow_sigmas by @hlky in #10242set_shift to FlowMatchEulerDiscreteScheduler by @hlky in #10269torch in get_2d_rotary_pos_embed by @hlky in #10155time_embed_dim of UNet2DModel changeable by @Bichidian in #10262from_pretrained by @hlky in #10189local_files_only for checkpoints with shards by @hlky in #10294sample_size attribute is to accept tuple(h, w) in StableDiffusionPipeline by @Foundsheep in #10181from_pretrained() by @sayakpaul in #10316get_parameter_dtype by @yiyixuxu in #10342.from_single_file() - Add missing .shape by @gau-nernst in #10332The following contributors have made significant changes to the library over the last release:
require_accelerate_version_greater (#9746)pipelines tests device-agnostic (part1) (#9399)pipelines tests device-agnostic (part2) (#9400)get_parameter_dtype (#10342)beta, exponential and karras sigmas to FlowMatchEulerDiscreteScheduler (#10001)sigmas to Flux pipelines (#10081)num_images_per_prompt>1 with Skip Guidance Layers in StableDiffusion3Pipeline (#10086)sigmas to np.array in FlowMatch set_timesteps (#10088)skip_guidance_layers in SD3 pipeline (#10102)pipeline_stable_audio formating (#10114)sigmas to pipelines using FlowMatch (#10116)torch in get_3d_rotary_pos_embed/_allegro (#10161)negative_* from SDXL callback (#10203)torch in get_2d_sincos_pos_embed and get_3d_sincos_pos_embed (#10156)t instead of timestep in _apply_perturbed_attention_guidance (#10243)dynamic_shifting to SD3 (#10236)use_flow_sigmas (#10242)set_shift to FlowMatchEulerDiscreteScheduler (#10269)torch in get_2d_rotary_pos_embed (#10155)from_pretrained (#10189)local_files_only for checkpoints with shards (#10294)Civitai into Existing Pipelines (#9986)SanaPipeline, SanaPAGPipeline, LinearAttentionProcessor, Flow-based DPM-sovler and so on. (#9982)Stability AI’s latest text-to-image generation model is Stable Diffusion 3.5 Large. SD3.5 Large is the next iteration of Stable Diffusion 3. It comes with two checkpoints (both of which have 8B params):
Make sure to fill up the form by going to the model page, and then run huggingface-cli login before running the code below.
# make sure to update diffusers
# pip install -U diffusers
import torch
from diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained(
"stabilityai/stable-diffusion-3.5-large", torch_dtype=torch.bfloat16
).to("cuda")
image = pipe(
prompt="a photo of a cat holding a sign that says hello world",
negative_prompt="",
num_inference_steps=40,
height=1024,
width=1024,
guidance_scale=4.5,
).images[0]
image.save("sd3_hello_world.png")
Follow the documentation to know more.
We added a new text-to-image model, Cogview3-plus, from the THUDM team! The model is DiT-based and supports image generation from 512 to 2048px. Thanks to @zRzRzRzRzRzRzR for contributing it!
from diffusers import CogView3PlusPipeline
import torch
pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3-Plus-3B", torch_dtype=torch.float16).to("cuda")
# Enable it to reduce GPU memory usage
pipe.enable_model_cpu_offload()
pipe.vae.enable_slicing()
pipe.vae.enable_tiling()
prompt = "A vibrant cherry red sports car sits proudly under the gleaming sun, its polished exterior smooth and flawless, casting a mirror-like reflection. The car features a low, aerodynamic body, angular headlights that gaze forward like predatory eyes, and a set of black, high-gloss racing rims that contrast starkly with the red. A subtle hint of chrome embellishes the grille and exhaust, while the tinted windows suggest a luxurious and private interior. The scene conveys a sense of speed and elegance, the car appearing as if it's about to burst into a sprint along a coastal road, with the ocean's azure waves crashing in the background."
image = pipe(
prompt=prompt,
guidance_scale=7.0,
num_images_per_prompt=1,
num_inference_steps=50,
width=1024,
height=1024,
).images[0]
image.save("cogview3.png")
Refer to the documentation to know more.
We have landed native quantization support in Diffusers, starting with bitsandbytes as its first quantization backend. With this, we hope to see large diffusion models becoming much more accessible to run on consumer hardware.
The example below shows how to run Flux.1 Dev with the NF4 data-type. Make sure you install the libraries:
pip install -Uq git+https://github.com/huggingface/transformers@main
pip install -Uq bitsandbytes
pip install -Uq diffusers
from diffusers import BitsAndBytesConfig, FluxTransformer2DModel
import torch
ckpt_id = "black-forest-labs/FLUX.1-dev"
nf4_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)
model_nf4 = FluxTransformer2DModel.from_pretrained(
ckpt_id,
subfolder="transformer",
quantization_config=nf4_config,
torch_dtype=torch.bfloat16
)
Then, we use model_nf4 to instantiate the FluxPipeline:
from diffusers import FluxPipeline
pipeline = StableDiffusion3Pipeline.from_pretrained(
ckpt_id,
transformer=model_nf4,
torch_dtype=torch.bfloat16
)
pipeline.enable_model_cpu_offload()
prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus, basking in a river of melted butter amidst a breakfast-themed landscape. It features the distinctive, bulky body shape of a hippo. However, instead of the usual grey skin, the creature's body resembles a golden-brown, crispy waffle fresh off the griddle. The skin is textured with the familiar grid pattern of a waffle, each square filled with a glistening sheen of syrup. The environment combines the natural habitat of a hippo with elements of a breakfast table setting, a river of warm, melted butter, with oversized utensils or plates peeking out from the lush, pancake-like foliage in the background, a towering pepper mill standing in for a tree. As the sun rises in this fantastical world, it casts a warm, buttery glow over the scene. The creature, content in its butter river, lets out a yawn. Nearby, a flock of birds take flight"
image = pipeline(
prompt=prompt,
negative_prompt="",
num_inference_steps=50,
guidance_scale=4.5,
max_sequence_length=512,
).images[0]
image.save("whimsical.png")
Follow the documentation here to know more. Additionally, check out this Colab Notebook that runs Flux.1 Dev in an end-to-end manner with NF4 quantization.
We have a fresh bucket of training scripts with this release:
Video model fine-tuning can be quite expensive. So, we have worked on a repository, cogvideox-factory, which provides memory-optimized scripts to fine-tune the Cog family of models.
num_train_epochs is passed in a distributed training env by @AnandK27 in #931616 by @yiyixuxu in #9573transformer.device_map by @sayakpaul in #9553joint_attention_kwargs is not passed to the FLUX's transformer attention processors by @HorizonWind2004 in #9517test_low_cpu_mem_usage_with_loading by @sayakpaul in #9662Community Pipeline] Add 🪆Matryoshka Diffusion Models by @tolgacangoz in #9157if not return_dict path by @hlky in #9649SD3ControlNetModel to SD3MultiControlNetModel by @hlky in #9652HunyuanDiT2DControlNetModel to HunyuanDiT2DMultiControlNetModel by @hlky in #9651DPMSolverSDE, Heun, KDPM2Ancestral and KDPM2 by @hlky in #9650Euler, EDMEuler, FlowMatchHeun, KDPM2Ancestral by @hlky in #9616src/diffusers/training_utils.py by @mreraser in #9606community/hd_painter.py by @Jwaminju in #9593models/embeddings_flax.py by @Jwaminju in #9592make deps_table_update to fix CI tests by @a-r-r-o-w in #9720bitsandbytes by @sayakpaul in #9213pipline_stable_diffusion.py by @jeongiin in #9590schedule_shifted_power usage in 🪆Matryoshka Diffusion Models by @tolgacangoz in #9723The following contributors have made significant changes to the library over the last release:
if not return_dict path (#9649)SD3ControlNetModel to SD3MultiControlNetModel (#9652)HunyuanDiT2DControlNetModel to HunyuanDiT2DMultiControlNetModel (#9651)DPMSolverSDE, Heun, KDPM2Ancestral and KDPM2 (#9650)Euler, EDMEuler, FlowMatchHeun, KDPM2Ancestral (#9616)16 (#9573)Community Pipeline] Add 🪆Matryoshka Diffusion Models (#9157)schedule_shifted_power usage in 🪆Matryoshka Diffusion Models (#9723)This patch release adds Diffusers support for the upcoming CogVideoX-5B-I2V release (an Image-to-Video generation model)! The model weights will be available by end of the week on the HF Hub at THUDM/CogVideoX-5b-I2V (Link). Stay tuned for the release!
This release features two new pipelines:
Additionally, we now have support for tiled encoding in the CogVideoX VAE. This can be enabled by calling the vae.enable_tiling() method, and it is used in the new Video-to-Video pipeline to encode sample videos to latents in a memory-efficient manner.
The code below demonstrates how to use the new image-to-video pipeline:
import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16)
pipe.to("cuda")
# Optionally, enable memory optimizations.
# If enabling CPU offloading, remember to remove `pipe.to("cuda")` above
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
prompt = "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
image = load_image(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
video = pipe(image, prompt, use_dynamic_cfg=True)
export_to_video(video.frames[0], "output.mp4", fps=8)
<table align=center>
<tr>
<td align=center colspan=1><img src="https://github.com/user-attachments/assets/1c7c1d86-f97e-44dd-9b17-4fec2bbc2b1a" /></td>
<td align=center colspan=1><video src="https://github.com/user-attachments/assets/a115372e-c539-4ca0-b0d0-770d62862257"> Your broswer does not support the video tag. </video></td>
</tr>
</table>
The code below demonstrates how to use the new video-to-video pipeline:
import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video
# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
pipe = CogVideoXVideoToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-trial", torch_dtype=torch.bfloat16)
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")
input_video = load_video(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
)
prompt = (
"An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and "
"valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in "
"the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, "
"moons, but the remainder of the scene is mostly realistic."
)
video = pipe(
video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)
<table align=center>
<tr>
<td align=center><video src="https://github.com/user-attachments/assets/bc9273ff-e459-42f9-af1e-c9b084b28f4d"> Your browser does not support the video tag. </video></td>
</tr>
</table>
Shoutout to @tin2tin for the awesome demonstration!
Refer to our documentation to learn more about it.
This patch release adds diffusers support for the upcoming CogVideoX-5B release! The model weights will be available next week on the Huggingface Hub at THUDM/CogVideoX-5b. Stay tuned for the release!
Additionally, we have implemented VAE tiling feature, which reduces the memory requirement for CogVideoX models. With this update, the total memory requirement is now 12GB for CogVideoX-2B and 21GB for CogVideoX-5B (with CPU offloading). To Enable this feature, simply call enable_tiling() on the VAE.
The code below shows how to generate a video with CogVideoX-5B
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
prompt = "Tracking shot,late afternoon light casting long shadows,a cyclist in athletic gear pedaling down a scenic mountain road,winding path with trees and a lake in the background,invigorating and adventurous atmosphere."
pipe = CogVideoXPipeline.from_pretrained(
"THUDM/CogVideoX-5b",
torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
video = pipe(
prompt=prompt,
num_videos_per_prompt=1,
num_inference_steps=50,
num_frames=49,
guidance_scale=6,
).frames[0]
export_to_video(video, "output.mp4", fps=8)
https://github.com/user-attachments/assets/c2d4f7e8-ef86-4da6-8085-cb9f83f47f34
Refer to our documentation to learn more about it.
imageio by @DN6 in #9094Image taken from the Lumina’s GitHub.
This release features many new pipelines. Below, we provide a list:
Audio pipelines 🎼
Video pipelines 📹
Image pipelines 🎇
Be sure to check out the respective docs to know more about these pipelines. Some additional pointers are below for curious minds:
optimum.quanto. Check it out here.| Without PAG | With PAG |
|---|---|
We already had community pipelines for PAG, but given its usefulness, we decided to make it a first-class citizen of the library. We have a central usage guide for PAG here, which should be the entry point for a user interested in understanding and using PAG for their use cases. We currently support the following pipelines with PAG:
StableDiffusionPAGPipelineStableDiffusion3PAGPipelineStableDiffusionControlNetPAGPipelineStableDiffusionXLPAGPipelineStableDiffusionXLPAGImg2ImgPipelineStableDiffusionXLPAGInpaintPipelineStableDiffusionXLControlNetPAGPipelineStableDiffusion3PAGPipelinePixArtSigmaPAGPipelineHunyuanDiTPAGPipelineAnimateDiffPAGPipelineKolorsPAGPipelineIf you’re interested in helping us extend our PAG support for other pipelines, please check out this thread. Special thanks to Ahn Donghoon (@sunovivid), the author of PAG, for helping us with the integration and adding PAG support to SD3.
SparseCtrl introduces methods of controllability into text-to-video diffusion models leveraging signals such as line/edge sketches, depth maps, and RGB images by incorporating an additional condition encoder, inspired by ControlNet, to process these signals in the AnimateDiff framework. It can be applied to a diverse set of applications such as interpolation or video prediction (filling in the gaps between sequence of images for animation), personalized image animation, sketch-to-video, depth-to-video, and more. It was introduced in SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models.
There are two SparseCtrl-specific checkpoints and a Motion LoRA made available by the authors namely:
Scribble Interpolation Example:
<table> <tr> <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png" alt="Image 1"></td> <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png" alt="Image 2"></td> <td><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png" alt="Image 3"></td> </tr> <tr> <td colspan="3" style="text-align: center; vertical-align: middle;"><img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-sparsectrl-scribble-results.gif" alt="Image 4"></td> </tr> </table>import torch
from diffusers import AnimateDiffSparseControlNetPipeline, AutoencoderKL, MotionAdapter, SparseControlNetModel
from diffusers.schedulers import DPMSolverMultistepScheduler
from diffusers.utils import export_to_gif, load_image
motion_adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-3", torch_dtype=torch.float16).to(device)
controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-scribble", torch_dtype=torch.float16).to(device)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to(device)
pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
"SG161222/Realistic_Vision_V5.1_noVAE",
motion_adapter=motion_adapter,
controlnet=controlnet,
vae=vae,
scheduler=scheduler,
torch_dtype=torch.float16,
).to(device)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, beta_schedule="linear", algorithm_type="dpmsolver++", use_karras_sigmas=True)
pipe.load_lora_weights("guoyww/animatediff-motion-lora-v1-5-3", adapter_name="motion_lora")
pipe.fuse_lora(lora_scale=1.0)
prompt = "an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality"
negative_prompt = "low quality, worst quality, letterboxed"
image_files = [
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png",
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png"
]
condition_frame_indices = [0, 8, 15]
conditioning_frames = [load_image(img_file) for img_file in image_files]
video = pipe(
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=25,
conditioning_frames=conditioning_frames,
controlnet_conditioning_scale=1.0,
controlnet_frame_indices=condition_frame_indices,
generator=torch.Generator().manual_seed(1337),
).frames[0]
export_to_gif(video, "output.gif")
📜 Check out the docs here.
FreeNoise is a training-free method that allows extending the generative capabilities of pretrained video diffusion models beyond their existing context/frame limits.
Instead of initializing noises for all frames, FreeNoise reschedules a sequence of noises for long-range correlation and performs temporal attention over them using a window-based function. We have added FreeNoise to the AnimateDiff family of models in Diffusers, allowing them to generate videos beyond their default 32 frame limit.
import torch
from diffusers import AnimateDiffPipeline, MotionAdapter, EulerAncestralDiscreteScheduler
from diffusers.utils import export_to_gif
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
pipe = AnimateDiffPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16)
pipe.scheduler = EulerAncestralDiscreteScheduler(
beta_schedule="linear",
beta_start=0.00085,
beta_end=0.012,
)
pipe.enable_free_noise()
pipe.vae.enable_slicing()
pipe.enable_model_cpu_offload()
frames = pipe(
"An astronaut riding a horse on Mars.",
num_frames=64,
num_inference_steps=20,
guidance_scale=7.0,
decode_chunk_size=2,
).frames[0]
export_to_gif(frames, "freenoise-64.gif")
We have significantly refactored the loader classes associated with LoRA. Going forward, this will help in adding LoRA support for new pipelines and models. We now have a LoraBaseMixin class which is subclassed by the different pipeline-level LoRA loading classes such as StableDiffusionXLLoraLoaderMixin. This document provides an overview of the available classes.
Additionally, we have increased the coverage of methods within the PeftAdapterMixin class. This refactoring allows all the supported models to share common LoRA functionalities such set_adapter(), add_adapter(), and so on.
To learn more details, please follow this PR. If you see any LoRA-related issues stemming from these refactors, please open an issue.
We discovered that the implementation of fuse_qkv_projections() was broken. This was fixed in this PR. Additionally, this PR added the fusion support to AuraFlow and PixArt Sigma. A reasoning as to where this kind of fusion might be useful is available here.
dreambooth_lora by @WenheLI in #8510HunyuanCombinedTimestepTextSizeStyleEmbedding by @yiyixuxu in #8761philosophy.md that were not reflected on #8294 by @mreraser in #8690attention_head_dim for JointTransformerBlock by @yiyixuxu in #8608LoraBaseMixin to promote reusability. by @sayakpaul in #8670LoraBaseMixin to promote reusability." by @sayakpaul in #8773push_to_hub trainers by @apolinario in #8697get_timestep_embedding by @alanhdu in #8811Cont'd] Add the SDE variant of train_cm_ct_unconditional.py by @tolgacangoz in #8653resume_download from Hub related stuff by @sayakpaul in #8648ethical_guidelines.md that were not reflected on #8294 by @mreraser in #8914diffusers-cli env by @tolgacangoz in #8408LoraLoaderMixin to the inits by @sayakpaul in #8981accelerator based gradient accumulation for basic_example by @RandomGamingDev in #8966\s+$ by @tolgacangoz in #9008get_attention_scores as optional in get_attention_scores by @psychedelicious in #9075CLIPFeatureExtractor to CLIPImageProcessor and DPTFeatureExtractor to DPTImageProcessor by @tolgacangoz in #9002vae_batch_size to decode_chunk_size by @DN6 in #9110The following contributors have made significant changes to the library over the last release:
vae_batch_size to decode_chunk_size (#9110)HunyuanCombinedTimestepTextSizeStyleEmbedding (#8761)attention_head_dim for JointTransformerBlock (#8608)import torch
from diffusers import StableDiffusion3ControlNetPipeline
from diffusers.models import SD3ControlNetModel, SD3MultiControlNetModel
from diffusers.utils import load_image
controlnet = SD3ControlNetModel.from_pretrained("InstantX/SD3-Controlnet-Canny", torch_dtype=torch.float16)
pipe = StableDiffusion3ControlNetPipeline.from_pretrained(
"stabilityai/stable-diffusion-3-medium-diffusers", controlnet=controlnet, torch_dtype=torch.float16
)
pipe.to("cuda")
control_image = load_image("https://huggingface.co/InstantX/SD3-Controlnet-Canny/resolve/main/canny.jpg")
prompt = "A girl holding a sign that says InstantX"
image = pipe(prompt, control_image=control_image, controlnet_conditioning_scale=0.7).images[0]
image.save("sd3.png")
📜 Refer to the official docs here to learn more about it.
Thanks to @haofanwang @wangqixun from the @ResearcherXman team for contributing this pipeline!
We now support all available single-file checkpoints for sd3 in diffusers! To load the single file checkpoint with t5
import torch
from diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_single_file(
"https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors",
torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()
image = pipe("a picture of a cat holding a sign that says hello world").images[0]
image.save('sd3-single-file-t5-fp8.png')
We increased the default sequence length for the T5 Text Encoder from a maximum of 77 to 256! It can be adjusted to accept fewer or more tokens by setting the max_sequence_length to a maximum of 512. Keep in mind that longer sequences require additional resources and will result in longer generation times. This effect is particularly noticeable during batch inference.
prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. This imaginative creature features the distinctive, bulky body of a hippo, but with a texture and appearance resembling a golden-brown, crispy waffle. The creature might have elements like waffle squares across its skin and a syrup-like sheen. It’s set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, possibly including oversized utensils or plates in the background. The image should evoke a sense of playful absurdity and culinary fantasy."
image = pipe(
prompt=prompt,
negative_prompt="",
num_inference_steps=28,
guidance_scale=4.5,
max_sequence_length=512,
).images[0]
| Before | max_sequence_length=256 | max_sequence_length=512 |
|---|---|---|
accelerate isn't installed. by @DN6 in #8462The following contributors have made significant changes to the library over the last release:
This release emphasizes Stable Diffusion 3, Stability AI’s latest iteration of the Stable Diffusion family of models. It was introduced in Scaling Rectified Flow Transformers for High-Resolution Image Synthesis by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach.
As the model is gated, before using it with diffusers, you first need to go to the Stable Diffusion 3 Medium Hugging Face page, fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate.
huggingface-cli login
The code below shows how to perform text-to-image generation with SD3:
import torch
from diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
image = pipe(
"A cat holding a sign that says hello world",
negative_prompt="",
num_inference_steps=28,
guidance_scale=7.0,
).images[0]
image
Refer to our documentation for learning all the optimizations you can apply to SD3 as well as the image-to-image pipeline.
Additionally, we support DreamBooth + LoRA fine-tuning of Stable Diffusion 3 through rectified flow. Check out this directory for more details.