0.1.2 — Diffusers — releases.sh

These are the release notes of the 🧨 Diffusers library

Introducing Hugging Face's new library for diffusion models.

Diffusion models proved themselves very effective in artificial synthesis, even beating GANs for images. Because of that, they gained traction in the machine learning community and play an important role for systems like DALL-E 2 or Imagen to generate photorealistic images when prompted on text.

While the most prolific successes of diffusion models have been in the computer vision community, these models have also achieved remarkable results in other domains, such as:

and more.

Goals

The goals of diffusers are:

to centralize the research of diffusion models from independent repositories to a clear and maintained project,
to reproduce high impact machine learning systems such as DALLE and Imagen in a manner that is accessible for the public, and
to create an easy to use API that enables one to train their own models or re-use checkpoints from other repositories for inference.

Release overview

Quickstart:

For a light walk-through of the library, please have a look at the Official 🧨 Diffusers Notebook.
To directly jump into training a diffusion model yourself, please have a look at the Training Diffusers Notebook

Diffusers aims to be a modular toolbox for diffusion techniques, with a focus the following categories:

:bullettrain_side: Inference pipelines

Inference pipelines are a collection of end-to-end diffusion systems that can be used out-of-the-box. The goal is for them to stick as close as possible to their original implementation, and they can include components of other libraries (such as text encoders).

The original release contains the following pipelines:

DDPM for unconditional image generation with discrete scheduling in pipeline_ddpm.
DDIM for unconditional image generation with discrete scheduling in pipeline_ddim.
PNDM for unconditional image generation with discrete scheduling in pipeline_pndm.
Stochastic Differential Equations for unconditional image generation with continuous scheduling in score_sde_ve
Latent diffusion for text to image generation / conditional image generation in pipeline_latent_diffusion as well as for unconditional image generation in latent_diffusion_uncond

We are currently working on enabling other pipelines for different modalities. The following pipelines are expected to land in a subsequent release:

BDDMPipeline for spectrogram-to-sound vocoding
GLIDEPipeline to support OpenAI's GLIDE model
Grad-TTS for text to audio generation / conditional audio generation
A reinforcement learning pipeline (happening in https://github.com/huggingface/diffusers/pull/105)

:alarm_clock: Schedulers

Schedulers are the algorithms to use diffusion models in inference as well as for training. They include the noise schedules and define algorithm-specific diffusion steps.
Schedulers can be used interchangable between diffusion models in inference to find the preferred tradef-off between speed and generation quality.
Schedulers are available in numpy, but can easily be transformed into PyTorch.

The goal is for each scheduler to provide one or more step() functions that should be called iteratively to unroll the diffusion loop during the forward pass. They are framework agnostic, but offer conversion methods which should allow easy conversion to PyTorch utilities.

The initial release contains the following schedulers:

DDIM, from the Denoising Diffusion Implicit Models paper.
DDPM, from the Denoising Diffusion Probabilistic Models paper.
PNDM, from the Pseudo Numerical Methods for Diffusion Models on Manifolds paper
SDE_VE, from the Score-Based Generative Modeling through Stochastic Differential Equations paper.

:factory: Models

Models are hosted in the src/diffusers/models folder.

For the initial release, you'll get to see a few building blocks, as well as some resulting models:

UNet2DModel can be seen as a version of the recent UNet architectures as shown in recent papers. It can be seen as the unconditional version of the UNet model, in opposition to the conditional version that follows below.
UNet2DConditionModel is similar to the UNet2DModel, but is conditional: it uses the cross-attention mechanism in order to have skip connections in its downsample and upsample layers. These cross-attentions can be fed by other models. An example of a pipeline using a conditional UNet model is the latent diffusion pipeline.
AutoencoderKL and VQModel are still experimental models that are prone to breaking changes in the near future. However, they can already be used as part of the Latent Diffusion pipelines.

:page_with_curl: Training example

The first release contains a dataset-agnostic unconditional example and a training notebook:

The train_unconditional.py example, which trains a DDPM UNet model on a dataset of your choice.
More examples can be found under the Hugging Face Diffusers Notebooks

Credits

This library concretizes previous work by many different authors and would not have been possible without their great research and implementations. We'd like to thank, in particular, the following implementations which have helped us in our development and without which the API could not have been as polished today:

@CompVis' latent diffusion models library, available here
@hojonathanho original DDPM implementation, available here as well as the extremely useful translation into PyTorch by @pesser, available here
@ermongroup's DDIM implementation, available here.
@yang-song's Score-VE and Score-VP implementations, available here

We also want to thank @heejkoo for the very helpful overview of papers, code and resources on diffusion models, available here.