---
name: Accelerate
slug: accelerate
type: github
source_url: https://github.com/huggingface/accelerate
organization: Hugging Face
organization_slug: hugging-face
total_releases: 71
latest_version: v1.13.0
latest_date: 2026-03-04
last_updated: 2026-04-18
tracking_since: 2021-03-05
canonical: https://releases.sh/hugging-face/accelerate
organization_url: https://releases.sh/hugging-face
---

<Summary type="rolling" window-days="90" release-count="1">
Accelerate expanded hardware support and cleaned up its device abstraction layer. AWS Neuron (Trainium/Inferentia) integration shipped alongside removal of the IPEX dependency, making XPU code more device-agnostic through spawn-based process launching. FSDP2 picked up fixes for gradient upcasting, tied embedding errors, and optimizer state loading.
</Summary>

<Summary type="monthly" period="March 2026" release-count="1">
Accelerate expanded hardware support and cleaned up dependencies in March. AWS Neuron integration arrived for Trainium/Inferentia devices, while IPEX was removed and XPU code was refactored to be device-agnostic. FSDP2 received multiple stability fixes including gradient upcasting improvements, tied embedding error handling, and optimizer state loading.
</Summary>

<Release version="v1.13.0" date="March 4, 2026" published="2026-03-04T20:15:55.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.13.0">
## v1.13.0: Neuron support, IPEX removal, and distributed training fixes

## AWS Neuron support
We now have support for  AWS Neuron (Trainium/Inferentia) devices. Thanks @michaelbenayoun for adding this. 
- Neuron integration by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3935

### XPU Improvements
We've removed IPEX dependency and improved device-agnostic code for XPU.
- using spawn instead of fork for XPU device by @kaixuanliu in https://github.com/huggingface/accelerate/pull/3884
- Remove ipex by @yao-matrix in https://github.com/huggingface/accelerate/pull/3883
- enhance new codes to XPU, and make them be device agnostic by @yao-matrix in https://github.com/huggingface/accelerate/pull/3890
- Fix KMP_AFFINITY incorrectly set for non-CPU training by @hexfaker in
https://github.com/huggingface/accelerate/pull/3912

## FSDP2 Improvements
We've added a bunch of important fixes for FSDP2 users: upcasting only grad-requiring params, better tied embedding errors, DCP optimizer loading, bf16 optimizer step crash fix, and torch < 2.7.0 compatibility.

- Upcast FSDP2 parameters only if requires_grad by @ojh31 in https://github.com/huggingface/accelerate/pull/3848
- Fix FSDP2 tied embedding errors with targeted ValueError guidance by @amanzoni1 in https://github.com/huggingface/accelerate/pull/3878
- bug: fsdp cannot load optimizer state using dcp by @flymin in https://github.com/huggingface/accelerate/pull/3904
- fix crash in optimizer.step when fsdp2 is enabled and model is bfloat16 by @sywangyi in https://github.com/huggingface/accelerate/pull/3905
- Fix FSDP2 crash with ignored_params on torch < 2.7.0 by @Mr-Neutr0n in https://github.com/huggingface/accelerate/pull/3924

## DeepSpeed Sequence Parallelism
We've added several fixes to the DeepSpeed + Sequence Parallelism integration introduced in v1.12.0, including evaluation support during SP training and proper process group handling.
- [SP] fix loss computation example by @kashif in https://github.com/huggingface/accelerate/pull/3858
- [SP and CP] error out if both CP and SP enabled by @kashif in https://github.com/huggingface/accelerate/pull/3862
- DeepSpeed has its own process group by @kashif in https://github.com/huggingface/accelerate/pull/3916
- [Deepspeed] skip device mesh creation when deepspeed and sp_size >1 by @kashif in https://github.com/huggingface/accelerate/pull/3914
- Enable evaluation during deepspeed Sequence Parallel by @jp1924 in https://github.com/huggingface/accelerate/pull/3917

### FP8
We've enhanced FP8 training. Thanks @shimizust for fixing torchao support. 
  - Fix FP8 torchao default config with padding and FSDP2 all-gather support by @shimizust in https://github.com/huggingface/accelerate/pull/3831
  - Fix execution with Transformer Engine by @ksivaman in https://github.com/huggingface/accelerate/pull/3852
  - add MS-AMP deprecation warnings by @neha222222 in https://github.com/huggingface/accelerate/pull/3857

### Performance
Accelerate now imports faster by deferring heavy dependencies, and torch.compile hooks are disabled lazily.
- Faster import by @SunMarc in https://github.com/huggingface/accelerate/pull/3953
- lazy compile disable by @SunMarc in https://github.com/huggingface/accelerate/pull/3947
- Disable hook compile by @SunMarc in https://github.com/huggingface/accelerate/pull/3888

 ### Minor fixes 
- Allow non-Tensor values in a batch with dispatch_batches=True by @tomaarsen in https://github.com/huggingface/accelerate/pull/3850
- fix module and optimizer parameter mismatch before prepare_tp_ by @naomili0924 in https://github.com/huggingface/accelerate/pull/3845
- Fix KeyError in extract_model_from_parallel for partial torch.compile by @amanzoni1 in https://github.com/huggingface/accelerate/pull/3881
- Fix hf_device_map device index comparison in prepare_model by @rezaqorbani in https://github.com/huggingface/accelerate/pull/3895
- Fix StatefulDataLoader KeyError with num_workers > 0 by @veeceey in https://github.com/huggingface/accelerate/pull/3931
- Fix stateful dataloader DDP by @SunMarc in https://github.com/huggingface/accelerate/pull/3952
- Fix: Remove duplicate W&B initialization in offline mode by @shantanugupta2004 in https://github.com/huggingface/accelerate/pull/3886
- Avoid using nvidia-smi on a CPU-only Colab instance by @FlorianVal in https://github.com/huggingface/accelerate/pull/3872
- Fix logging logic when in_order is set to True by @yuxinyuan in https://github.com/huggingface/accelerate/pull/3280
- Fix cpu offload check by @SunMarc in https://github.com/huggingface/accelerate/pull/3946
- fix bug when both cpu_ram_efficient_loading and cpu_offload are enabled by @kaixuanliu in https://github.com/huggingface/accelerate/pull/3910
- Fix async compatibility across python versions by @SunMarc in https://github.com/huggingface/accelerate/pull/3901
- fix tp only bug by @sywangyi in https://github.com/huggingface/accelerate/pull/3908
- fix parallelism_config None error by @jp1924 in https://github.com/huggingface/accelerate/pull/3927
- Np parall fix by @sywangyi in https://github.com/huggingface/accelerate/pull/3900
- change the default value of fsdp_min_num_params to int by @CodeMan62 in https://github.com/huggingface/accelerate/pull/3902
- Fix mutable default in Megatron init and IndexError on empty ModuleList by @jashshah999 in https://github.com/huggingface/accelerate/pull/3944
- Prepare TP fix by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3945
- feat: added fine tuning example focused on TPUs by @tengomucho in https://github.com/huggingface/accelerate/pull/3847
- Remove 8bit force hook for bnb by @SunMarc in https://github.com/huggingface/accelerate/pull/3907
- docs: flag MS-AMP as deprecated in low-precision training guides by @ManasVardhan in https://github.com/huggingface/accelerate/pull/3929
- fix: correct typo 'guarentee' to 'guarantee' by @thecaptain789 in https://github.com/huggingface/accelerate/pull/3922
- Updating support of Megatron-LM by @pengdurice in https://github.com/huggingface/accelerate/pull/3842
- Update support of Megatron-LM PR 2 by @pengdurice in https://github.com/huggingface/accelerate/pull/3887
- Fix RNG state setting for HPU by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3936
- fix: load the HPU RNG state by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3937
</Release>

<Release version="v1.12.0" date="November 21, 2025" published="2025-11-21T12:47:55.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.12.0">
## v1.12.0: Deepspeed Ulysses/ALST

## Deepspeed Ulysses/ALST integration

Deepspeed Ulysses/ALST is an efficient way of training on long sequences by employing sequence parallelism and attention head parallelism. You can learn more about this technology in this paper https://arxiv.org/abs/2506.13996 or this deepspeed tutorial https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/.

<img width="2368" height="1250" alt="0d8bd9e0" src="https://github.com/user-attachments/assets/b94e90c9-4368-4711-ad57-58de3c714ebc" />


To enable Deepspeed Ulysses, you first need to create `ParallelismConfig` and setting `sp` related args: 

```python
parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=2,
    sp_handler=DeepSpeedSequenceParallelConfig(...),
)
```
Then, you need to make sure to compute the correct loss as described on our [docs](https://huggingface.co/docs/accelerate/main/en/concept_guides/sequence_parallelism)

```python 
        ...
        losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
        good_tokens = (shift_labels != -100).view(-1).sum()
        good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
        total_loss = sum(
            losses_per_rank[rank] * good_tokens_per_rank[rank]
            for rank in range(sp_world_size)
            if good_tokens_per_rank[rank] > 0
        )
        total_good_tokens = sum(good_tokens_per_rank)
        loss = total_loss / max(total_good_tokens, 1)
```

Thanks @S1ro1  for starting this work and for @stas00 for finishing this work. Also thanks @kashif for adding docs and reviewing/testing this PR !

This feature will also be available in HF Trainer thanks for this PR from @stas00: https://github.com/huggingface/transformers/pull/41832


## Minor changes

* Remove warning for `cpu_ram_efficient_loading` by @SunMarc in https://github.com/huggingface/accelerate/pull/3816
* update typo in bnb quantisation 4bit flag docstring by @hbraith in https://github.com/huggingface/accelerate/pull/3828
* ArXiv -> HF Papers by @qgallouedec in https://github.com/huggingface/accelerate/pull/3834
* Fix typo in broadcast_object_list docstring by @wsntxxn in https://github.com/huggingface/accelerate/pull/3823
* [Bug] Update torch.optim.Optimizer parameter states after tensor parallelism by @naomili0924 in https://github.com/huggingface/accelerate/pull/3835
* use self hosted runner by @SunMarc in https://github.com/huggingface/accelerate/pull/3841
* device type helper by @kashif in https://github.com/huggingface/accelerate/pull/3843

## New Contributors
* @hbraith made their first contribution in https://github.com/huggingface/accelerate/pull/3828
* @wsntxxn made their first contribution in https://github.com/huggingface/accelerate/pull/3823
* @naomili0924 made their first contribution in https://github.com/huggingface/accelerate/pull/3835

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.11.0...v1.12.0
</Release>

<Release version="v1.11.0" date="October 20, 2025" published="2025-10-20T16:08:58.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.11.0">
## v1.11.0: TE MXFP8, FP16/BF16 with MPS, Python 3.10 

## TE MXFP8 support

We've added support for MXFP8 in our TransformerEngine integration. To use that, you need to set  `use_mxfp8_block_scaling` in `fp8_config`. See nvidia docs [here]. (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling) 

* Add support for TE MXFP8 recipe in accelerate by @pstjohn in https://github.com/huggingface/accelerate/pull/3688

## FP16/BF16 Training for MPS devices

BF16 and FP16 support for MPS devices is finally here. You can now pass `mixed_precision = "fp16" or "bf16"` when training on a mac (`fp16` requires torch 2.8 and `bf16` requires torch 2.6)
* Add bf16/fp16 support for amp with mps device by @SunMarc in https://github.com/huggingface/accelerate/pull/3373

## FSDP updates

The following PRs  add respectively support to `ignored_params` and `no_sync()` for FSDPv2:
* feat: add ignored_params support for fsdp2 by @kmehant in https://github.com/huggingface/accelerate/pull/3731
* fix: model.set_requires_gradient_sync(False) should be called to turn off gradient synchronization in FSDP2 by @EquationWalker in https://github.com/huggingface/accelerate/pull/3762

Mixed precision can now be passed as a dtype string from accelerate cli flag or `fsdp_config` in accelerate config file:
* feat: allow mixed precision policy as dtype by @kmehant in https://github.com/huggingface/accelerate/pull/3751

## Nd-parallel updates

Some minor updates concerning nd-parallelism. 

* Context Parallelism docs typos fixed by @sergiopaniego in https://github.com/huggingface/accelerate/pull/3761
* Feat: add to_json by @S1ro1 in https://github.com/huggingface/accelerate/pull/3743
* make torch_native_parallelism examples device agnostic by @yao-matrix in https://github.com/huggingface/accelerate/pull/3759
* [ND Parallel] Update examples, cleanup by @S1ro1 in https://github.com/huggingface/accelerate/pull/3737

## Bump to Python 3.10 

We've dropped support for python 3.9 as it reached EOL in October. 
* Bump to python3.10 + update linter by @SunMarc in https://github.com/huggingface/accelerate/pull/3809

### Lots of minor fixes: 
* fix: CPU RAM efficient loading for nd or HSDP parallelisms by @kmehant in https://github.com/huggingface/accelerate/pull/3740
* xpu INT64 all_gather issue fixed in 2.9 by @yao-matrix in https://github.com/huggingface/accelerate/pull/3756
* Specify device_ids in torch.distributed.barrier for PartialState by @qgallouedec in https://github.com/huggingface/accelerate/pull/3744
* fix: specify device for process_tensor in example usage by @qgallouedec in https://github.com/huggingface/accelerate/pull/3755
* Lower complexity of get_balanced_memory by adding a set by @SamuelBarryCS in https://github.com/huggingface/accelerate/pull/3776
* Fix (skip) cuda cache flush when origin device is `cpu` and offloaded to `meta` by @Qubitium in https://github.com/huggingface/accelerate/pull/3796
* Fix convert LayerNorm without bias to fp8 by @mjun0812 in https://github.com/huggingface/accelerate/pull/3725
* Add optional typing by @cyyever in https://github.com/huggingface/accelerate/pull/3769
* refactor: Use  `with`  in Accelerator.autocast()instead of ` __enter__()` and` __exit__()` for more elegant style. by @EquationWalker in https://github.com/huggingface/accelerate/pull/3767
* switch XPU ccl backend to torch-builtin xccl in test_zero3_integration by @yao-matrix in https://github.com/huggingface/accelerate/pull/3773
* fix FSDP2 test case failure on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3771
* Fix tests by @SunMarc in https://github.com/huggingface/accelerate/pull/3722
* Protect import for device_mesh  by @SunMarc in https://github.com/huggingface/accelerate/pull/3742
* Fix `SWANLAB_MODE` by @SunMarc in https://github.com/huggingface/accelerate/pull/3808
* Fix tracking swanlab by @SunMarc in https://github.com/huggingface/accelerate/pull/3810
* refactor: nit change for get_parameters_from_modules (code debt) by @kmehant in https://github.com/huggingface/accelerate/pull/3815
* Remove deprecated FindTiedParametersResult by @cyyever in https://github.com/huggingface/accelerate/pull/3786
* Add optional typing by @cyyever in https://github.com/huggingface/accelerate/pull/3769
* remove mlflow from testing  by @SunMarc in https://github.com/huggingface/accelerate/pull/3783
* enable 2 model hook ut cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3774
* Added Tip for better rendering by @sergiopaniego in https://github.com/huggingface/accelerate/pull/3781
* Fix typos by @cyyever in https://github.com/huggingface/accelerate/pull/3753
* fix: torch_npu import error in some envs by @yanyongyu in https://github.com/huggingface/accelerate/pull/3764
* Fix: typo makes tests fail by @S1ro1 in https://github.com/huggingface/accelerate/pull/3765
* fix Muti node CUDA error: invalid device ordinal #3775 by @RicardoDominguez in https://github.com/huggingface/accelerate/pull/3779
* use reset_peak_memory_stats on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3772


## New Contributors
* @mjun0812 made their first contribution in https://github.com/huggingface/accelerate/pull/3725
* @sergiopaniego made their first contribution in https://github.com/huggingface/accelerate/pull/3761
* @EquationWalker made their first contribution in https://github.com/huggingface/accelerate/pull/3762
* @yanyongyu made their first contribution in https://github.com/huggingface/accelerate/pull/3764
* @RicardoDominguez made their first contribution in https://github.com/huggingface/accelerate/pull/3779
* @SamuelBarryCS made their first contribution in https://github.com/huggingface/accelerate/pull/3776
* @Qubitium made their first contribution in https://github.com/huggingface/accelerate/pull/3796

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.10.1...v1.11.0
</Release>

<Release version="v1.10.1" date="August 25, 2025" published="2025-08-25T13:57:15.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.10.1">
## v1.10.1: Patchfix

- Feat: add to_json by @S1ro1 in https://github.com/huggingface/accelerate/pull/3743
- Protect import for device_mesh by @SunMarc in https://github.com/huggingface/accelerate/pull/3742. 

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.10.0...v1.10.1
</Release>

<Release version="v1.10.0" date="August 7, 2025" published="2025-08-07T13:39:44.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.10.0">
## v1.10.0: N-D Parallelism

# N-D Parallelism

Training large models across multiple GPUs can be complex, especially when combining [different parallelism strategies](https://huggingface.co/spaces/nanotron/ultrascale-playbook) (e.g TP, CP, DP). To simplify this process, we've collaborated with [Axolotl](https://github.com/axolotl-ai-cloud/axolotl/) to introduce an easy-to-use integration that allows you to apply any combination of parallelism strategies directly in your training script. Just pass a `ParallelismConfig` specifying the size of each parallelism type—it's that simple.
Learn more about how it works in our latest [blogpost](https://github.com/huggingface/blog/pull/3006).

```python
parallelism_config = ParallelismConfig(
    dp_shard_size=2,
    dp_replicate_size=2,
    cp_size=2,
    tp_size=2,
)
accelerator = Accelerator(
    parallelism_config=parallelism_config,
   ...
)
model = AutoModelForCausalLM.from_pretrained("your-model-name", device_mesh=accelerator.torch_device_mesh)
model = accelerator.prepare(model)
```

* Parallelism config + TP + HSDP +  BYODM (Bring Your Own Device Mesh) by @SalmanMohammadi in https://github.com/huggingface/accelerate/pull/3682
* Feat: context parallel v2.0 by @S1ro1 in https://github.com/huggingface/accelerate/pull/3700
* set default submesh_tp_size to prevent unset local variable error by @winglian in https://github.com/huggingface/accelerate/pull/3687
* Add Parallelism getter property to Accelerator class by @WoosungMyung in https://github.com/huggingface/accelerate/pull/3703
* Fix: prepare works even if nothing except tp specified (rare) by @S1ro1 in https://github.com/huggingface/accelerate/pull/3707
* Set parallelism_config in constructor due to Trainer reset of State by @winglian in https://github.com/huggingface/accelerate/pull/3713
* Fix: tp size wouldn't read from env by @S1ro1 in https://github.com/huggingface/accelerate/pull/3716
* Remove `ParallelismConfig` from `PartialState` by @SunMarc in https://github.com/huggingface/accelerate/pull/3720


# FSDP improvements

We've fixed ignored modules attribute. With this, it is now possible to train PEFT model that moe layers that contrains `q_proj` and `v_proj` parameters. This is especially important for fine-tuning `gpt-oss` model. 

* ENH: Allow FSDP ignored modules to be regex by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3698
* TST Add test for FSDP ignored_modules as str by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3719

# Minor improvements
* feature: CpuOffload pre_forward don't attempt to move if already on device by @JoeGaffney in https://github.com/huggingface/accelerate/pull/3695
* Fix: Ensure environment variable values are case-insensitive in Accelerate by @jp1924 in https://github.com/huggingface/accelerate/pull/3712
* remove use_ipex by @SunMarc in https://github.com/huggingface/accelerate/pull/3721

# New Contributors
* @SalmanMohammadi made their first contribution in https://github.com/huggingface/accelerate/pull/3682
* @WoosungMyung made their first contribution in https://github.com/huggingface/accelerate/pull/3703
* @jp1924 made their first contribution in https://github.com/huggingface/accelerate/pull/3712
* @JoeGaffney made their first contribution in https://github.com/huggingface/accelerate/pull/3695

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.9.0...v1.10.0
</Release>

<Release version="v1.9.0" date="July 16, 2025" published="2025-07-16T16:35:54.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.9.0">
## v1.9.0: Trackio support, Model loading speedup, Minor distributed improvements

# Trackio tracker support
We've added support for a trackio, lightweight, 💯 free experiment tracking Python library built on top of 🤗 Datasets and Spaces.

![Screen Recording 2025-06-11 at 5 39 32 PM](https://github.com/user-attachments/assets/5cf12286-54e7-4119-8a20-88c2cbd37ab6)

Main features are: 
- *Local-first* design: dashboard runs locally by default. You can also host it on Spaces by specifying a `space_id`.
- Persists logs locally (or in a private Hugging Face Dataset)
- Visualize experiments with a Gradio dashboard locally (or on Hugging Face Spaces)
- Everything here, including hosting on Hugging Faces, is **free**!

To use it with accelerate, you need to set `log_with` and initialize the trackers 
```python 
accelerator = Accelerator(log_with="trackio")
config={"learning_rate": 0.001, "batch_size": 32}
# init_kwargs in order to host the dashboard on spaces
init_kwargs = {"trackio": {"space_id": "hf_username/space_name"}
accelerator.init_trackers("example_project", config=config, init_kwargs=init_kwargs})
```
Thanks @pcuenca for the integration ! 
* trackio by @pcuenca in https://github.com/huggingface/accelerate/pull/3669

## Model loading speedup when relying `set_module_tensor_to_device `
Setting tensor while clearing cache is very slow, so we added `clear_device` option to disable it. 
Another small optimization is using `non_blocking` everywhere and syncing just before returning control to the user. This makes the loading slightly faster.
* Speedup model loading by 4-5x in Diffusers ⚡ by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3674

## FDSP, Deepspeed, FP8 minor improvements

* Add support for e5e2 and default to hybrid when launcher is used by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3640
* Fix FP8 tests, enable FP8 to be used without direct `Accelerator()` configuring by @pstjohn in https://github.com/huggingface/accelerate/pull/3677
* Bunch of FSDP improvements by @S1ro1 in https://github.com/huggingface/accelerate/pull/3671
* Fix: properly error when DDP + Dtensor model by @S1ro1 in https://github.com/huggingface/accelerate/pull/3629
* Fix fsdp2 example typo by @shimizust in https://github.com/huggingface/accelerate/pull/3657
* Added a check in no_sync() to avoid errors when using deepspeed zero2/3 by @xliu0105 in https://github.com/huggingface/accelerate/pull/3656

## 🚨🚨🚨 Breaking changes 🚨🚨🚨
`find_executable_batch_size()` will no longer halves the batch after every OOM. Instead, we will multiply the batch size by 0.9. This should help user not waste gpu capacity. 

* “Stop Halving My Batch!” · Default back-off 0.5 → 0.9 by @SunMarc in https://github.com/huggingface/accelerate/pull/3684

## What's Changed

* [typo] shards instead of shard  by @SunMarc in https://github.com/huggingface/accelerate/pull/3645
* Docs: Fix typos in gradient accumulation guide by @kilavvy in https://github.com/huggingface/accelerate/pull/3649
* xpu enablement on left cases  by @yao-matrix in https://github.com/huggingface/accelerate/pull/3654
* unpin datasets in examples requirements by @SunMarc in https://github.com/huggingface/accelerate/pull/3681
* fix: wandb config not saved in offline mode by @ved1beta in https://github.com/huggingface/accelerate/pull/3648
* accelerate/data_loader.py: do not yield if the base_dataloader is empty by @0xnightwind in https://github.com/huggingface/accelerate/pull/3659
* warn for invalid keys by @ved1beta in https://github.com/huggingface/accelerate/pull/3613
* Update Gaudi runner image to latest SynapseAI and enable previously disabled tests by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3653

## New Contributors
* @kilavvy made their first contribution in https://github.com/huggingface/accelerate/pull/3649
* @shimizust made their first contribution in https://github.com/huggingface/accelerate/pull/3657
* @xliu0105 made their first contribution in https://github.com/huggingface/accelerate/pull/3656
* @0xnightwind made their first contribution in https://github.com/huggingface/accelerate/pull/3659

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.8.1...v1.9.0
</Release>

<Release version="v1.8.1" date="June 20, 2025" published="2025-06-20T15:43:33.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.8.1">
## v1.8.1: Patchfix

- Add support for e5e2 and default to hybrid when launcher is used by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3640
- shards by @SunMarc in https://github.com/huggingface/accelerate/pull/3645

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.8.0...v1.8.1
</Release>

<Release version="v1.8.0" date="June 19, 2025" published="2025-06-19T15:37:06.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.8.0">
## v1.8.0: FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecation 

# FSDPv2 refactor + FP8 support

We've simplified how to prepare FSDPv2 models, as there were too many ways to compose FSDP2 with other features (e.g., FP8, torch.compile, activation checkpointing, etc.). Although the setup is now more restrictive, it leads to fewer errors and a more performant user experience. We’ve also added support for FP8. You can read about the results [here](https://github.com/huggingface/accelerate/tree/main/examples/fsdp2). Thanks to @S1ro1 for this contribution!

* [FSDP2] Refactor + FP8 by @S1ro1 in https://github.com/huggingface/accelerate/pull/3585

# Faster Distributed Training on Intel CPUs

We updated the `CCL_WORKER_COUNT` variable and added `KMP` parameters for Intel CPU users. This significantly improves distributed training performance (e.g., Tensor Parallelism), with up to a 40% speed-up on Intel 4th Gen Xeon when training transformer TP models.

* Set ccl and KMP param in simple launch by @jiqing-feng in https://github.com/huggingface/accelerate/pull/3575

# Regional Compilation for DeepSpeed

We added support for regional compilation with the DeepSpeed engine. DeepSpeed’s .compile() modifies models in-place using torch.nn.Module.compile(...), rather than the out-of-place torch.compile(...), so we had to account for that. Thanks @IlyasMoutawwakil for this feature!

* Fix deepspeed regional compilation by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3609

# ipex.optimize deprecation
`ipex.optimize` is being deprecated. Most optimizations have been upstreamed to PyTorch, and future improvements will land there directly. For users without PyTorch 2.8, we’ll continue to rely on IPEX for now.

* remove ipex.optimize in accelerate by @yao-matrix in https://github.com/huggingface/accelerate/pull/3608

# Better XPU Support
We've greatly expanded and stabilized support for Intel XPUs:

* enable fsdp2 benchmark on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3590
* enable big_model_inference on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3595
* enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU by @yao-matrix in 
* enable test_cli & test_example cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3578
* enable torchao and pippy test cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3599 
* enable regional_compilation benchmark on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3592
* fix xpu 8bit value loading by @jiqing-feng in https://github.com/huggingface/accelerate/pull/3623
* add device-agnostic GradScaler by @yao-matrix in https://github.com/huggingface/accelerate/pull/3588
* add xpu support in TorchTensorParallelPlugin by @yao-matrix in https://github.com/huggingface/accelerate/pull/3627

# Trackers

We've added support for [SwanLab](https://github.com/SwanHubX/SwanLab) as an experiment tracking backend. Huge thanks to @ShaohonChen for this contribution ! We also deferred all tracker initializations to prevent premature setup of distributed environments.

* Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in https://github.com/huggingface/accelerate/pull/3605
* Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup by @yuanjua in https://github.com/huggingface/accelerate/pull/3581


## What's Changed
* Fix bf16 training with TP  by @SunMarc in https://github.com/huggingface/accelerate/pull/3610
* better handle FP8 with and without deepspeed by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3611
* Update Gaudi Runners by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3593
* goodbye torch_ccl by @yao-matrix in https://github.com/huggingface/accelerate/pull/3580
* Add support for standalone mode when default port is occupied on single node by @laitifranz in https://github.com/huggingface/accelerate/pull/3576
* Resolve logger warnings by @emmanuel-ferdman in https://github.com/huggingface/accelerate/pull/3582
* Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` by @luiz0992 in https://github.com/huggingface/accelerate/pull/3540
* [docs] no hard-coded cuda in the ddp documentation by @faaany in https://github.com/huggingface/accelerate/pull/3589
* change to use torch.device by @yao-matrix in https://github.com/huggingface/accelerate/pull/3594
* Fix: list object has no attribute keys by @S1ro1 in https://github.com/huggingface/accelerate/pull/3603
* Update Gaudi Runners by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3593
* Fix bf16 training with TP  by @SunMarc in https://github.com/huggingface/accelerate/pull/3610
* better handle FP8 with and without deepspeed by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3611
* Remove device_count for TPU launcher to avoid initializing runtime by @sorgfresser in https://github.com/huggingface/accelerate/pull/3587
* Fix missing te.LayerNorm in intel_transformer_engine by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3619
* Add fp8_e5m2 support in `dtype_byte_size` by @SunMarc in https://github.com/huggingface/accelerate/pull/3625
* [Deepspeed] deepspeed auto grad accum by @kashif in https://github.com/huggingface/accelerate/pull/3630
* Remove hardcoded cuda from fsdpv2 by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3631
* Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in https://github.com/huggingface/accelerate/pull/3605
* Fix Typos in Documentation and Comments by @leopardracer in https://github.com/huggingface/accelerate/pull/3621
* feat: use datasets.IterableDataset shard if possible by @SunMarc in https://github.com/huggingface/accelerate/pull/3635
* [DeepSpeed] sync gradient accum steps from deepspeed plugin by @kashif in https://github.com/huggingface/accelerate/pull/3632
* Feat: add cpu offload by @S1ro1 in https://github.com/huggingface/accelerate/pull/3636
* Fix: correct labels for fsdp2 examples by @S1ro1 in https://github.com/huggingface/accelerate/pull/3637
* fix grad acc deepspeed by @SunMarc in https://github.com/huggingface/accelerate/pull/3638

## New Contributors
* @laitifranz made their first contribution in https://github.com/huggingface/accelerate/pull/3576
* @emmanuel-ferdman made their first contribution in https://github.com/huggingface/accelerate/pull/3582
* @yuanjua made their first contribution in https://github.com/huggingface/accelerate/pull/3581
* @sorgfresser made their first contribution in https://github.com/huggingface/accelerate/pull/3587
* @ShaohonChen made their first contribution in https://github.com/huggingface/accelerate/pull/3605
* @leopardracer made their first contribution in https://github.com/huggingface/accelerate/pull/3621

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.7.0...v1.8.0
</Release>

<Release version="v1.7.0" date="May 15, 2025" published="2025-05-15T12:33:48.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.7.0">
## v1.7.0 : Regional compilation, Layerwise casting hook, FSDPv2 + QLoRA

# Regional compilation

Instead of compiling the entire model at once, regional compilation targets repeated blocks (such as decoder layers) first. This allows the compiler to cache and reuse optimized code for subsequent blocks, significantly reducing the cold start compilation time typically seen during the first inference. Thanks @IlyasMoutawwakil  for the feature ! You can view the full benchmark [here](https://github.com/huggingface/accelerate/tree/main/benchmarks/torch.compile), and check out our updated [compilation guide](https://huggingface.co/docs/accelerate/en/usage_guides/compilation) for more details!

![compilation_time-1](https://github.com/user-attachments/assets/38795d12-6ee7-4a10-84c6-d29a0877e36c)

To enable this feature, set `use_regional_compilation=True` in the `TorchDynamoPlugin` configuration.

```python
# Configure the compilation backend
dynamo_plugin = TorchDynamoPlugin(
    use_regional_compilation=True,
    ... # other parameters
)
# Initialize accelerator with the plugin
accelerator = Accelerator(dynamo_plugin=dynamo_plugin)
# This will apply compile_regions to your model
model = accelerator.prepare(model)
```

# Layerwise casting hook

We've introduced a new hook that enables per-layer upcasting and downcasting (e.g., for Linear layers) during inference. This allows users to run models with separate storage and compute dtypes, resulting in memory savings. The concept was first implemented in [diffusers](https://huggingface.co/docs/diffusers/main/en/optimization/memory#layerwise-casting), where downcasting models to FP8 proved effective without major quality degradation. Contributed by @sayakpaul in https://github.com/huggingface/accelerate/pull/3427

```python
model = ....
storage_dtype = torch.float8_e4m3fn
compute_dtype = torch.bfloat16
attach_layerwise_casting_hooks(
            model,
            storage_dtype=storage_dtype,
            compute_dtype=compute_dtype,
        )
```


# Better FSDP2 support 

This release includes numerous new features and bug fixes. Notably, we’ve added support for `FULL_STATE_DICT`, a widely used option in FSDP, now enabling `.save_pretrained()` in transformers to work with FSDP2 wrapped models. QLoRA training is now supported as well but more testing is needed. We have also resolved a backend issue related to parameter offloading to CPU. Additionally, a significant memory spike that occurred when `cpu_ram_efficient_loading=True` was enabled has been fixed. Several other minor improvements and fixes are also included—see the **What’s Changed** section for full details.

- `FULL_STATE_DICT` have been enabled by @S1ro1 in https://github.com/huggingface/accelerate/pull/3527
-  QLoRA support by @winglian in https://github.com/huggingface/accelerate/pull/3546
-  set backend correctly for CUDA+FSDP2+cpu-offload in https://github.com/huggingface/accelerate/pull/3574
-  memory spike fixed when using `cpu_ram_efficient_loading=True` by @S1ro1 in https://github.com/huggingface/accelerate/pull/3482

# Better HPU support:

We have added a [documentation](https://huggingface.co/docs/accelerate/en/usage_guides/gaudi) for Intel Gaudi hardware ! 
The support is already available since v1.5.0 through this [PR](https://github.com/huggingface/accelerate/pull/3378). 

- Add the HPU into accelerate config by @yuanwu2017 in https://github.com/huggingface/accelerate/pull/3495
- Add Gaudi doc by @regisss in https://github.com/huggingface/accelerate/pull/3537

# Torch.compile breaking change for `dynamic` argument

We've updated the logic for setting `self.dynamic` to explicitly preserve None rather than defaulting to `False` when the `USE_DYNAMIC` environment variable is unset. This change aligns the behavior with the PyTorch documentation for [torch.compile](https://docs.pytorch.org/stable/generated/torch.compile.html). Thanks to @yafshar for contributing this improvement in [#3567](https://github.com/huggingface/accelerate/pull/3567).

## What's Changed
* use device agnostic torch.OutOfMemoryError from pytorch 2.5.0 by @yao-matrix in https://github.com/huggingface/accelerate/pull/3475
* Adds style bot by @zach-huggingface in https://github.com/huggingface/accelerate/pull/3478
* Fix a tiny typo in `low_precision_training` guide by @sadra-barikbin in https://github.com/huggingface/accelerate/pull/3488
* Fix check_tied_parameters_in_config for multimodal models by @SunMarc in https://github.com/huggingface/accelerate/pull/3479
* Don't create new param for TorchAO sequential offloading due to weak BC guarantees by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3444
* add support for custom function for reducing the batch size by @winglian in https://github.com/huggingface/accelerate/pull/3071
* Fix fp8 deepspeed config by @SunMarc in https://github.com/huggingface/accelerate/pull/3492
* fix warning error by @faaany in https://github.com/huggingface/accelerate/pull/3491
* [bug] unsafe_serialization option in "merge-weights" doesn't work by @cyr0930 in https://github.com/huggingface/accelerate/pull/3496
* Add the HPU into accelerate config by @yuanwu2017 in https://github.com/huggingface/accelerate/pull/3495
* Use `torch.distributed.checkpoint.state_dict.set_model_state_dict` in `load_checkpoint_in_model` by @ringohoffman in https://github.com/huggingface/accelerate/pull/3432
* nit: needed sanity checks for fsdp2 by @kmehant in https://github.com/huggingface/accelerate/pull/3499
* (Part 1) fix: make TP training compatible with new transformers by @kmehant in https://github.com/huggingface/accelerate/pull/3457
* Fix deepspeed tests by @S1ro1 in https://github.com/huggingface/accelerate/pull/3503
* Add FP8 runners + tweak building FP8 image by @zach-huggingface in https://github.com/huggingface/accelerate/pull/3493
* fix: apply torchfix to set `weights_only=True` by @bzhong-solink in https://github.com/huggingface/accelerate/pull/3497
* Fix: require transformers version for tp tests by @S1ro1 in https://github.com/huggingface/accelerate/pull/3504
* Remove deprecated PyTorch/XLA APIs by @zpcore in https://github.com/huggingface/accelerate/pull/3484
* Fix cache issue by upgrading github actions version  by @SunMarc in https://github.com/huggingface/accelerate/pull/3513
* [Feat] Layerwise casting hook by @sayakpaul in https://github.com/huggingface/accelerate/pull/3427
* Add torchao to FP8 error message by @jphme in https://github.com/huggingface/accelerate/pull/3514
* Fix unwanted cuda init due to torchao by @SunMarc in https://github.com/huggingface/accelerate/pull/3530
* Solve link error in internal_mechanism documentation (#3506) by @alvaro-mazcu in https://github.com/huggingface/accelerate/pull/3507
* [FSDP2] Enable FULL_STATE_DICT by @S1ro1 in https://github.com/huggingface/accelerate/pull/3527
* [FSDP2] Fix memory spike with `cpu_ram_efficient_loading=True` by @S1ro1 in https://github.com/huggingface/accelerate/pull/3482
* [FSDP2] Issues in Wrap Policy and Mixed Precision by @jhliu17 in https://github.com/huggingface/accelerate/pull/3528
* Fix logic in `accelerator.prepare` + IPEX for 2+ `nn.Models` and/or `optim.Optimizers` by @mariusarvinte in https://github.com/huggingface/accelerate/pull/3517
* Update Docker builds to align with CI requirements by @matthewdouglas in https://github.com/huggingface/accelerate/pull/3532
* Fix CI due to missing package  by @SunMarc in https://github.com/huggingface/accelerate/pull/3535
* Update big_modeling.md for layerwise casting by @sayakpaul in https://github.com/huggingface/accelerate/pull/3548
* [FSDP2] Fix: "..." is not a buffer or a paremeter by @S1ro1 in
* fix notebook_launcher for Colab TPU compatibility. by @BogdanDidenko in https://github.com/huggingface/accelerate/pull/3541
* Fix typos by @omahs in https://github.com/huggingface/accelerate/pull/3549
* Dynamo regional compilation by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3529
* add support for port 0 auto-selection in multi-GPU environments by @hellobiondi in https://github.com/huggingface/accelerate/pull/3501
* Fix the issue where `set_epoch` does not take effect. by @hongjx175 in https://github.com/huggingface/accelerate/pull/3556
* [FSDP2] Fix casting in `_cast_and_contiguous` by @dlvp in https://github.com/huggingface/accelerate/pull/3559
* [FSDP] Make env var and dataclass flag consistent for `cpu_ram_efficient_loading`  by @SumanthRH in https://github.com/huggingface/accelerate/pull/3307
* canonicalize optimized names before fixing optimizer in fdsp2 by @pstjohn in https://github.com/huggingface/accelerate/pull/3560
* [docs] update deepspeed config path by @faaany in https://github.com/huggingface/accelerate/pull/3561
* preserve parameter keys when removing  prefix by @mjkvaak-amd in https://github.com/huggingface/accelerate/pull/3564
* Add Gaudi doc by @regisss in https://github.com/huggingface/accelerate/pull/3537
* Update dynamic env handling to preserve None when USE_DYNAMIC is unset by @yafshar in https://github.com/huggingface/accelerate/pull/3567
* add a `synchronize` call for xpu in `_gpu_gather` by @faaany in https://github.com/huggingface/accelerate/pull/3563
* simplify model.to logic by @yao-matrix in https://github.com/huggingface/accelerate/pull/3562
* tune env command output by @yao-matrix in https://github.com/huggingface/accelerate/pull/3570
* Add regional compilation to cli tools and env vars by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3572
* reenable FSDP2+qlora support by @winglian in https://github.com/huggingface/accelerate/pull/3546
* Fix prevent duplicate GPU usage in distributed processing by @ved1beta in https://github.com/huggingface/accelerate/pull/3526
* set backend correctly for CUDA+FSDP2+cpu-offload by @SunMarc in https://github.com/huggingface/accelerate/pull/3574
* enable test_dispatch_model_tied_weights_memory_with_nested_offload_cpu on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3569


## New Contributors
* @zach-huggingface made their first contribution in https://github.com/huggingface/accelerate/pull/3478
* @sadra-barikbin made their first contribution in https://github.com/huggingface/accelerate/pull/3488
* @ringohoffman made their first contribution in https://github.com/huggingface/accelerate/pull/3432
* @bzhong-solink made their first contribution in https://github.com/huggingface/accelerate/pull/3497
* @zpcore made their first contribution in https://github.com/huggingface/accelerate/pull/3484
* @jphme made their first contribution in https://github.com/huggingface/accelerate/pull/3514
* @alvaro-mazcu made their first contribution in https://github.com/huggingface/accelerate/pull/3507
* @jhliu17 made their first contribution in https://github.com/huggingface/accelerate/pull/3528
* @BogdanDidenko made their first contribution in https://github.com/huggingface/accelerate/pull/3541
* @hellobiondi made their first contribution in https://github.com/huggingface/accelerate/pull/3501
* @hongjx175 made their first contribution in https://github.com/huggingface/accelerate/pull/3556
* @dlvp made their first contribution in https://github.com/huggingface/accelerate/pull/3559
* @pstjohn made their first contribution in https://github.com/huggingface/accelerate/pull/3560
* @mjkvaak-amd made their first contribution in https://github.com/huggingface/accelerate/pull/3564
* @yafshar made their first contribution in https://github.com/huggingface/accelerate/pull/3567
* @ved1beta made their first contribution in https://github.com/huggingface/accelerate/pull/3526

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.6.0...v1.7.0
</Release>

<Release version="v1.6.0" date="April 1, 2025" published="2025-04-01T13:48:21.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.6.0">
## v1.6.0: FSDPv2, DeepSpeed TP and XCCL backend support

# FSDPv2 support
This release introduces the support for [FSDPv2](https://pytorch.org/docs/stable/distributed.fsdp.fully_shard.html) thanks to @S1ro1.

If you are using python code, you need to set `fsdp_version=2` in `FullyShardedDataParallelPlugin`:
```python 
from accelerate import FullyShardedDataParallelPlugin, Accelerator

fsdp_plugin = FullyShardedDataParallelPlugin(
    fsdp_version=2
    # other options...
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
```

If want to convert a YAML config that contains the FSDPv1 config to FSDPv2 one , use our conversion tool: 
```
accelerate to-fsdp2 --config_file config.yaml --output_file new_config.yaml`
```

To learn more about the difference between FSDPv1 and FSDPv2, read the following [documentation](https://huggingface.co/docs/accelerate/main/en/concept_guides/fsdp1_vs_fsdp2). 

# DeepSpeed TP support

We have added initial support for DeepSpeed + TP. Not many changes were required as the DeepSpeed APIs was already compatible. We only needed to make sure that the dataloader was compatible with TP and that we were able to save the TP weights. Thanks @inkcherry for the work ! https://github.com/huggingface/accelerate/pull/3390.

To use TP with deepspeed, you need to update the setting in the deepspeed config file by including `tensor_parallel` key:
```
    ....
    "tensor_parallel":{
      "autotp_size": ${autotp_size}
    },
   ...
```
More details in this deepspeed [PR](https://github.com/deepspeedai/DeepSpeed/pull/6922). 


# Support for XCCL distributed backend

We've added support for XCCL which is an Intel distributed backend which can be used with XPU devices. More details in this torch [PR](https://github.com/pytorch/pytorch/issues/141741). Thanks @dvrogozh for the [integration](https://github.com/huggingface/accelerate/pull/3401) ! 

## What's Changed
* Add `log_artifact`, `log_artifacts` and `log_figure` capabilities to the MLflowTracker. by @luiz0992 in https://github.com/huggingface/accelerate/pull/3419
* tensor parallel dataloder for deepspeed accelerator by @inkcherry in https://github.com/huggingface/accelerate/pull/3390
* Fix prod issues by @muellerzr in https://github.com/huggingface/accelerate/pull/3441
* Fix attribute issue with deepspeed tp by @SunMarc in https://github.com/huggingface/accelerate/pull/3443
* Fixed typo in the multi node FSDP slurm example script by @JacobB33 in https://github.com/huggingface/accelerate/pull/3447
* feat: Add no_ssh and slurm multinode launcher options for deepspeed by @hsmallbone in https://github.com/huggingface/accelerate/pull/3329
* Fixup ao module filter func by @muellerzr in https://github.com/huggingface/accelerate/pull/3450
* remove device index workaround on xpu since xpu supports integer device index as cuda now by @yao-matrix in https://github.com/huggingface/accelerate/pull/3448
* enable 2 UT cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3445
* Fix AMD GPU support with should_reduce_batch_size() by @cameronshinn in https://github.com/huggingface/accelerate/pull/3405
* Fix device KeyError in tied_params_map by @dvrogozh in https://github.com/huggingface/accelerate/pull/3403
* Initial FSDP2 support by @S1ro1 in https://github.com/huggingface/accelerate/pull/3394
* Fix: clip grad norm in fsdp2 by @S1ro1 in https://github.com/huggingface/accelerate/pull/3465
* Update @ by @muellerzr in https://github.com/huggingface/accelerate/pull/3466
* Fix seeding of new generator for multi GPU by @albertcthomas in https://github.com/huggingface/accelerate/pull/3459
* Fix get_balanced_memory for MPS by @booxter in https://github.com/huggingface/accelerate/pull/3464
* Update CometMLTracker to allow re-using experiment by @Lothiraldan in https://github.com/huggingface/accelerate/pull/3328
* Apply ruff py39 fixes by @cyyever in https://github.com/huggingface/accelerate/pull/3461
* xpu: enable xccl distributed backend by @dvrogozh in https://github.com/huggingface/accelerate/pull/3401
* Update ruff target-version to py39 and apply more fixes by @cyyever in https://github.com/huggingface/accelerate/pull/3470
* [MLU] fix deepspeed dependency  by @huismiling in https://github.com/huggingface/accelerate/pull/3472
* remove use_xpu to fix ut issues, we don't need this since XPU is OOB … by @yao-matrix in https://github.com/huggingface/accelerate/pull/3460
* Bump ruff to 0.11.2 by @cyyever in https://github.com/huggingface/accelerate/pull/3471

## New Contributors
* @luiz0992 made their first contribution in https://github.com/huggingface/accelerate/pull/3419
* @inkcherry made their first contribution in https://github.com/huggingface/accelerate/pull/3390
* @JacobB33 made their first contribution in https://github.com/huggingface/accelerate/pull/3447
* @hsmallbone made their first contribution in https://github.com/huggingface/accelerate/pull/3329
* @yao-matrix made their first contribution in https://github.com/huggingface/accelerate/pull/3448
* @cameronshinn made their first contribution in https://github.com/huggingface/accelerate/pull/3405
* @S1ro1 made their first contribution in https://github.com/huggingface/accelerate/pull/3394
* @albertcthomas made their first contribution in https://github.com/huggingface/accelerate/pull/3459
* @booxter made their first contribution in https://github.com/huggingface/accelerate/pull/3464
* @Lothiraldan made their first contribution in https://github.com/huggingface/accelerate/pull/3328
* @cyyever made their first contribution in https://github.com/huggingface/accelerate/pull/3461

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.5.2...v1.6.0
</Release>

<Release version="v1.5.2" date="March 14, 2025" published="2025-03-14T14:16:16.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.5.2">
## Patch: v1.5.2

**Bug Fixes**:
* Fixed an issue with `torch.get_default_device()` requiring a higher version than what we support
* Fixed a broken `pytest` import in prod

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.5.0...v1.5.2
</Release>

<Release version="v1.5.0" date="March 12, 2025" published="2025-03-12T14:18:54.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.5.0">
## v1.5.0: HPU support

## HPU Support
* Adds in HPU accelerator support for 🤗 Accelerate


## What's Changed
* [bug] fix device index bug for model training loaded with bitsandbytes by @faaany in https://github.com/huggingface/accelerate/pull/3408
* [docs] add the missing `import torch` by @faaany in https://github.com/huggingface/accelerate/pull/3396
* minor doc fixes by @nbroad1881 in https://github.com/huggingface/accelerate/pull/3365
* fix: ensure CLI args take precedence over config file. by @cyr0930 in https://github.com/huggingface/accelerate/pull/3409
* fix: Add `device=torch.get_default_device()` in `torch.Generator`s by @saforem2 in https://github.com/huggingface/accelerate/pull/3420
* Add Tecorigin SDAA accelerator support by @siqi654321 in https://github.com/huggingface/accelerate/pull/3330
* fix typo : thier -> their by @hackty in https://github.com/huggingface/accelerate/pull/3423
* Fix quality by @muellerzr in https://github.com/huggingface/accelerate/pull/3424
* Distributed inference example for llava_next by @VladOS95-cyber in https://github.com/huggingface/accelerate/pull/3417
* HPU support by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3378

## New Contributors
* @cyr0930 made their first contribution in https://github.com/huggingface/accelerate/pull/3409
* @saforem2 made their first contribution in https://github.com/huggingface/accelerate/pull/3420
* @siqi654321 made their first contribution in https://github.com/huggingface/accelerate/pull/3330
* @hackty made their first contribution in https://github.com/huggingface/accelerate/pull/3423
* @VladOS95-cyber made their first contribution in https://github.com/huggingface/accelerate/pull/3417
* @IlyasMoutawwakil made their first contribution in https://github.com/huggingface/accelerate/pull/3378

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.4.0...v1.5.0
</Release>

<Release version="v1.4.0" date="February 17, 2025" published="2025-02-17T17:18:10.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.4.0">
## v1.4.0: `torchao` FP8, TP & dataLoader support, fix memory leak

## `torchao` FP8, initial Tensor Parallel support, and memory leak fixes

### `torchao` FP8

This release introduces a new FP8 API and brings in a new backend: [`torchao`](https://github.com/pytorch/ao/tree/main/torchao/float8). To use, pass in `AORecipeKwargs` to the `Accelerator` while setting `mixed_precision="fp8"`. This is initial support, as it matures we will incorporate more into it (such as `accelerate config`/yaml) in future releases. See our benchmark examples [here](https://github.com/huggingface/accelerate/tree/main/benchmarks/fp8/torchao)

## TensorParallel

We have intial support for an in-house solution to TP when working with accelerate dataloaders. check out the PR [here](https://github.com/huggingface/accelerate/pull/3173)

## Bug fixes
* fix triton version check by @faaany in https://github.com/huggingface/accelerate/pull/3345
* fix torch_dtype in estimate memory by @SunMarc in https://github.com/huggingface/accelerate/pull/3383
* works for fp8 with deepspeed by @XiaobingSuper in https://github.com/huggingface/accelerate/pull/3361
* [`memory leak`] Replace GradientState -> DataLoader reference with weakrefs by @tomaarsen in https://github.com/huggingface/accelerate/pull/3391

## What's Changed
* fix triton version check by @faaany in https://github.com/huggingface/accelerate/pull/3345
* [tests] enable BNB test cases in `tests/test_quantization.py` on XPU by @faaany in https://github.com/huggingface/accelerate/pull/3349
* [Dev] Update release directions by @muellerzr in https://github.com/huggingface/accelerate/pull/3352
* [tests] make cuda-only test work on other hardware accelerators by @faaany in https://github.com/huggingface/accelerate/pull/3302
* [tests] remove `require_non_xpu` test markers by @faaany in https://github.com/huggingface/accelerate/pull/3301
* Support more functionalities for MUSA backend by @fmo-mt in https://github.com/huggingface/accelerate/pull/3359
* [tests] enable more bnb tests on XPU   by @faaany in https://github.com/huggingface/accelerate/pull/3350
* feat: support tensor parallel & Data loader by @kmehant in https://github.com/huggingface/accelerate/pull/3173
* DeepSpeed github repo move sync by @stas00 in https://github.com/huggingface/accelerate/pull/3376
* [tests] Fix bnb cpu error by @faaany in https://github.com/huggingface/accelerate/pull/3351
* fix torch_dtype in estimate memory by @SunMarc in https://github.com/huggingface/accelerate/pull/3383
* works for fp8 with deepspeed by @XiaobingSuper in https://github.com/huggingface/accelerate/pull/3361
* fix: typos in documentation files by @maximevtush in https://github.com/huggingface/accelerate/pull/3388
* [examples] upgrade code for seed setting   by @faaany in https://github.com/huggingface/accelerate/pull/3387
* [`memory leak`] Replace GradientState -> DataLoader reference with weakrefs by @tomaarsen in https://github.com/huggingface/accelerate/pull/3391
* add xpu check in `get_quantized_model_device_map` by @faaany in https://github.com/huggingface/accelerate/pull/3397
* Torchao float8 training by @muellerzr in https://github.com/huggingface/accelerate/pull/3348

## New Contributors
* @kmehant made their first contribution in https://github.com/huggingface/accelerate/pull/3173
* @XiaobingSuper made their first contribution in https://github.com/huggingface/accelerate/pull/3361
* @maximevtush made their first contribution in https://github.com/huggingface/accelerate/pull/3388

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.3.0...v1.4.0
</Release>

<Release version="v1.3.0" date="January 17, 2025" published="2025-01-17T15:56:18.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.3.0">
## v1.3.0 Bug fixes + Require torch 2.0

## Torch 2.0
As it's been ~2 years since torch 2.0 was first released, we are now requiring this as the **minimum version for Accelerate**, which similarly was done in `transformers` as of its last release.

## Core
* [docs] no hard-coding cuda by @faaany in https://github.com/huggingface/accelerate/pull/3270
* fix load_state_dict for npu by @ji-huazhong in https://github.com/huggingface/accelerate/pull/3211
* Add `keep_torch_compile` param to `unwrap_model` and `extract_model_from_parallel` for distributed compiled model. by @ggoggam in https://github.com/huggingface/accelerate/pull/3282
* [tests] make cuda-only test case device-agnostic by @faaany in https://github.com/huggingface/accelerate/pull/3340
* latest bnb no longer has optim_args attribute on optimizer by @winglian in https://github.com/huggingface/accelerate/pull/3311
* add torchdata version check to avoid "in_order" error by @faaany in https://github.com/huggingface/accelerate/pull/3344
* [docs] fix typo, change "backoff_filter" to "backoff_factor" by @suchot in https://github.com/huggingface/accelerate/pull/3296
* dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in https://github.com/huggingface/accelerate/pull/3346
* feat(tpu): remove nprocs from xla.spawn by @tengomucho in https://github.com/huggingface/accelerate/pull/3324

## Big Modeling
* Fix test_nested_hook by @SunMarc in https://github.com/huggingface/accelerate/pull/3289
* correct the return statement of _init_infer_auto_device_map by @Nech-C in https://github.com/huggingface/accelerate/pull/3279
* Use torch.xpu.mem_get_info for XPU by @dvrogozh in https://github.com/huggingface/accelerate/pull/3275
* Ensure that tied parameter is children of module by @pablomlago in https://github.com/huggingface/accelerate/pull/3327
* Fix for offloading when using TorchAO >= 0.7.0 by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3332
* Fix offload generate tests by @SunMarc in https://github.com/huggingface/accelerate/pull/3334

## Examples
* Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in https://github.com/huggingface/accelerate/pull/3193

## Full Changelog
## What's Changed
* [docs] no hard-coding cuda by @faaany in https://github.com/huggingface/accelerate/pull/3270
* fix load_state_dict for npu by @ji-huazhong in https://github.com/huggingface/accelerate/pull/3211
* Fix test_nested_hook by @SunMarc in https://github.com/huggingface/accelerate/pull/3289
* correct the return statement of _init_infer_auto_device_map by @Nech-C in https://github.com/huggingface/accelerate/pull/3279
* Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in https://github.com/huggingface/accelerate/pull/3193
* Use torch.xpu.mem_get_info for XPU by @dvrogozh in https://github.com/huggingface/accelerate/pull/3275
* Add `keep_torch_compile` param to `unwrap_model` and `extract_model_from_parallel` for distributed compiled model. by @ggoggam in https://github.com/huggingface/accelerate/pull/3282
* Ensure that tied parameter is children of module by @pablomlago in https://github.com/huggingface/accelerate/pull/3327
* Bye bye torch <2 by @muellerzr in https://github.com/huggingface/accelerate/pull/3331
* Fixup docker build err by @muellerzr in https://github.com/huggingface/accelerate/pull/3333
* feat(tpu): remove nprocs from xla.spawn by @tengomucho in https://github.com/huggingface/accelerate/pull/3324
* Fix offload generate tests by @SunMarc in https://github.com/huggingface/accelerate/pull/3334
* [tests] make cuda-only test case device-agnostic by @faaany in https://github.com/huggingface/accelerate/pull/3340
* latest bnb no longer has optim_args attribute on optimizer by @winglian in https://github.com/huggingface/accelerate/pull/3311
* Fix for offloading when using TorchAO >= 0.7.0 by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3332
* add torchdata version check to avoid "in_order" error by @faaany in https://github.com/huggingface/accelerate/pull/3344
* [docs] fix typo, change "backoff_filter" to "backoff_factor" by @suchot in https://github.com/huggingface/accelerate/pull/3296
* dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in https://github.com/huggingface/accelerate/pull/3346

## New Contributors
* @ylacombe made their first contribution in https://github.com/huggingface/accelerate/pull/3193
* @ggoggam made their first contribution in https://github.com/huggingface/accelerate/pull/3282
* @pablomlago made their first contribution in https://github.com/huggingface/accelerate/pull/3327
* @tengomucho made their first contribution in https://github.com/huggingface/accelerate/pull/3324
* @suchot made their first contribution in https://github.com/huggingface/accelerate/pull/3296

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.2.1...v1.3.0
</Release>

<Release version="v1.2.1" date="December 13, 2024" published="2024-12-13T18:56:09.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.2.1">
## v1.2.1: Patchfix

* fix: add max_memory to _init_infer_auto_device_map's return statement in https://github.com/huggingface/accelerate/pull/3279 by @Nech-C 
* fix load_state_dict for npu in https://github.com/huggingface/accelerate/pull/3211 by @statelesshz 

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.2.0...v1.2.1
</Release>

<Release version="v1.2.0" date="December 13, 2024" published="2024-12-13T18:47:06.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.2.0">
## v1.2.0: Bug Squashing & Fixes across the board

## Core
* enable `find_executable_batch_size` on XPU by @faaany in https://github.com/huggingface/accelerate/pull/3236
* Use `numpy._core` instead of `numpy.core` by @qgallouedec in https://github.com/huggingface/accelerate/pull/3247
* Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in https://github.com/huggingface/accelerate/pull/3066
* Allow for full dynamo config passed to Accelerator by @muellerzr in https://github.com/huggingface/accelerate/pull/3251
* [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3252
* [`data_loader`] Optionally also propagate set_epoch to batch sampler by @tomaarsen in https://github.com/huggingface/accelerate/pull/3246
* use XPU instead of GPU in the `accelerate config` prompt text by @faaany in https://github.com/huggingface/accelerate/pull/3268

## Big Modeling
* Fix `align_module_device`, ensure only cpu tensors for `get_state_dict_offloaded_model` by @kylesayrs in https://github.com/huggingface/accelerate/pull/3217
* Remove hook for bnb 4-bit  by @SunMarc in https://github.com/huggingface/accelerate/pull/3223
* [docs] add instruction to install bnb on non-cuda devices by @faaany in https://github.com/huggingface/accelerate/pull/3227
* Take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in https://github.com/huggingface/accelerate/pull/3226
* Update deferring_execution.md by @max-yue in https://github.com/huggingface/accelerate/pull/3262
* Revert default behavior of `get_state_dict_from_offload` by @kylesayrs in https://github.com/huggingface/accelerate/pull/3253
* Fix: Resolve #3060, `preload_module_classes` is lost for nested modules by @wejoncy in https://github.com/huggingface/accelerate/pull/3248

## DeepSpeed
* Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in https://github.com/huggingface/accelerate/pull/3255
* support for wrapped schedulefree optimizer when using deepspeed by @winglian in https://github.com/huggingface/accelerate/pull/3266

## Documentation

* Update code in tracking documentation  by @faaany in https://github.com/huggingface/accelerate/pull/3235
* Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in https://github.com/huggingface/accelerate/pull/3259

* Update set-seed by @faaany in https://github.com/huggingface/accelerate/pull/3228
* Fix typo by @faaany in https://github.com/huggingface/accelerate/pull/3221
* Use real path for `checkpoint` by @faaany in https://github.com/huggingface/accelerate/pull/3220
* Fixed multiple typos for Tutorials and Guides docs by @henryhmko in https://github.com/huggingface/accelerate/pull/3274

## New Contributors
* @winglian made their first contribution in https://github.com/huggingface/accelerate/pull/3266
* @max-yue made their first contribution in https://github.com/huggingface/accelerate/pull/3262
* @as12138 made their first contribution in https://github.com/huggingface/accelerate/pull/3261
* @relh made their first contribution in https://github.com/huggingface/accelerate/pull/3259
* @wejoncy made their first contribution in https://github.com/huggingface/accelerate/pull/3248
* @henryhmko made their first contribution in https://github.com/huggingface/accelerate/pull/3274


## Full Changelog
* Fix `align_module_device`, ensure only cpu tensors for `get_state_dict_offloaded_model` by @kylesayrs in https://github.com/huggingface/accelerate/pull/3217
* remove hook for bnb 4-bit  by @SunMarc in https://github.com/huggingface/accelerate/pull/3223
* enable `find_executable_batch_size` on XPU by @faaany in https://github.com/huggingface/accelerate/pull/3236
* take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in https://github.com/huggingface/accelerate/pull/3226
* [docs] update code in tracking documentation  by @faaany in https://github.com/huggingface/accelerate/pull/3235
* Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in https://github.com/huggingface/accelerate/pull/3066
* [`data_loader`] Optionally also propagate set_epoch to batch sampler by @tomaarsen in https://github.com/huggingface/accelerate/pull/3246
* [docs] add instruction to install bnb on non-cuda devices by @faaany in https://github.com/huggingface/accelerate/pull/3227
* Use `numpy._core` instead of `numpy.core` by @qgallouedec in https://github.com/huggingface/accelerate/pull/3247
* Allow for full dynamo config passed to Accelerator by @muellerzr in https://github.com/huggingface/accelerate/pull/3251
* [WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3252
* use XPU instead of GPU in the `accelerate config` prompt text by @faaany in https://github.com/huggingface/accelerate/pull/3268
* support for wrapped schedulefree optimizer when using deepspeed by @winglian in https://github.com/huggingface/accelerate/pull/3266
* Update deferring_execution.md by @max-yue in https://github.com/huggingface/accelerate/pull/3262
* Fix: Resolve #3257 by @as12138 in https://github.com/huggingface/accelerate/pull/3261
* Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in https://github.com/huggingface/accelerate/pull/3259
* Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in https://github.com/huggingface/accelerate/pull/3255
* Revert default behavior of `get_state_dict_from_offload` by @kylesayrs in https://github.com/huggingface/accelerate/pull/3253
* Fix: Resolve #3060, `preload_module_classes` is lost for nested modules by @wejoncy in https://github.com/huggingface/accelerate/pull/3248
* [docs] update set-seed by @faaany in https://github.com/huggingface/accelerate/pull/3228
* [docs] fix typo by @faaany in https://github.com/huggingface/accelerate/pull/3221
* [docs] use real path for `checkpoint` by @faaany in https://github.com/huggingface/accelerate/pull/3220
* Fixed multiple typos for Tutorials and Guides docs by @henryhmko in https://github.com/huggingface/accelerate/pull/3274

## Code Diff
Release diff: https://github.com/huggingface/accelerate/compare/v1.1.1...v1.2.0
</Release>

<Release version="v1.1.0" date="November 1, 2024" published="2024-11-01T15:30:17.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.1.0">
## v1.1.0: Python 3.9 minimum, torch dynamo deepspeed support, and bug fixes

## Internals:
* Allow for a `data_seed` argument in https://github.com/huggingface/accelerate/pull/3150
* Trigger `weights_only=True` by default for all compatible objects when checkpointing and saving with `torch.save` in https://github.com/huggingface/accelerate/pull/3036
* Handle negative values for `dim` input in `pad_across_processes` in https://github.com/huggingface/accelerate/pull/3114
* Enable cpu bnb distributed lora finetune in https://github.com/huggingface/accelerate/pull/3159

## DeepSpeed
* Support torch dynamo for deepspeed>=0.14.4 in https://github.com/huggingface/accelerate/pull/3069

## Megatron
* update Megatron-LM plugin code to version 0.8.0 or higher in https://github.com/huggingface/accelerate/pull/3174

## Big Model Inference
* New `has_offloaded_params` utility added in https://github.com/huggingface/accelerate/pull/3188

## Examples
* Florence2 distributed inference example in https://github.com/huggingface/accelerate/pull/3123

## Full Changelog
* Handle negative values for `dim` input in `pad_across_processes` by @mariusarvinte in https://github.com/huggingface/accelerate/pull/3114
* Fixup DS issue with weakref by @muellerzr in https://github.com/huggingface/accelerate/pull/3143
* Refactor scaler to util by @muellerzr in https://github.com/huggingface/accelerate/pull/3142
* DS fix, continued by @muellerzr in https://github.com/huggingface/accelerate/pull/3145
* Florence2 distributed inference example by @hlky in https://github.com/huggingface/accelerate/pull/3123
* POC: Allow for a `data_seed` by @muellerzr in https://github.com/huggingface/accelerate/pull/3150
* Adding multi gpu speech generation by @dame-cell in https://github.com/huggingface/accelerate/pull/3149
* support torch dynamo for deepspeed>=0.14.4 by @oraluben in https://github.com/huggingface/accelerate/pull/3069
* Fixup Zero3 + `save_model` by @muellerzr in https://github.com/huggingface/accelerate/pull/3146
* Trigger `weights_only=True` by default for all compatible objects by @muellerzr in https://github.com/huggingface/accelerate/pull/3036
* Remove broken dynamo test by @oraluben in https://github.com/huggingface/accelerate/pull/3155
* fix version check bug in `get_xpu_available_memory` by @faaany in https://github.com/huggingface/accelerate/pull/3165
* enable cpu bnb distributed lora finetune by @jiqing-feng in https://github.com/huggingface/accelerate/pull/3159
* [Utils] `has_offloaded_params` by @kylesayrs in https://github.com/huggingface/accelerate/pull/3188
* fix bnb by @eljandoubi in https://github.com/huggingface/accelerate/pull/3186
* [docs] update neptune API by @faaany in https://github.com/huggingface/accelerate/pull/3181
* docs: fix a wrong word in comment in src/accelerate/accelerate.py:1255 by @Rebornix-zero in https://github.com/huggingface/accelerate/pull/3183
* [docs] use nn.module instead of tensor as model  by @faaany in https://github.com/huggingface/accelerate/pull/3157
* Fix typo by @kylesayrs in https://github.com/huggingface/accelerate/pull/3191
* MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in https://github.com/huggingface/accelerate/pull/3187
* update Megatron-LM plugin code to version 0.8.0 or higher. by @eljandoubi in https://github.com/huggingface/accelerate/pull/3174
* 🚨 🚨 🚨 Goodbye Python 3.8! 🚨 🚨 🚨  by @muellerzr in https://github.com/huggingface/accelerate/pull/3194
* Update transformers.deepspeed references from transformers 4.46.0 release by @loadams in https://github.com/huggingface/accelerate/pull/3196
* eliminate dead code by @statelesshz in https://github.com/huggingface/accelerate/pull/3198
* take `torch.nn.Module` model into account when moving to device   by @faaany in https://github.com/huggingface/accelerate/pull/3167
* [docs] add xpu part and fix bug in `torchrun`  by @faaany in https://github.com/huggingface/accelerate/pull/3166
* Models With Tied Weights Need Re-Tieing After FSDP Param Init by @fabianlim in https://github.com/huggingface/accelerate/pull/3154
* add the missing xpu for local sgd by @faaany in https://github.com/huggingface/accelerate/pull/3163
* typo fix in big_modeling.py by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3207
* [Utils] `align_module_device` by @kylesayrs in https://github.com/huggingface/accelerate/pull/3204

## New Contributors
* @mariusarvinte made their first contribution in https://github.com/huggingface/accelerate/pull/3114
* @hlky made their first contribution in https://github.com/huggingface/accelerate/pull/3123
* @dame-cell made their first contribution in https://github.com/huggingface/accelerate/pull/3149
* @kylesayrs made their first contribution in https://github.com/huggingface/accelerate/pull/3188
* @eljandoubi made their first contribution in https://github.com/huggingface/accelerate/pull/3186
* @Rebornix-zero made their first contribution in https://github.com/huggingface/accelerate/pull/3183
* @loadams made their first contribution in https://github.com/huggingface/accelerate/pull/3196

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.0.1...v1.1.0
</Release>

<Release version="v1.0.1" date="October 12, 2024" published="2024-10-12T03:01:13.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.0.1">
## v1.0.1: Bugfix

## Bugfixes

* Fixes an issue where the `auto` values were no longer being parsed when using [deepspeed](https://github.com/huggingface/accelerate/pull/3143)
* Fixes a broken test in the deepspeed tests related to the [auto values](https://github.com/huggingface/accelerate/pull/3145)

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v1.0.0...v1.0.1
</Release>

<Release version="v1.0.0" date="October 7, 2024" published="2024-10-07T15:42:18.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v1.0.0">
## Accelerate 1.0.0 is here!

## 🚀 Accelerate 1.0 🚀 

With `accelerate` 1.0, we are officially stating that the core parts of the API are now "stable" and ready for the future of what the world of distributed training and PyTorch has to handle. With these release notes, we will focus first on the major breaking changes to get your code fixed, followed by what is new specifically between 0.34.0 and 1.0.

To read more, check out our official blog [here](https://huggingface.co/blog/accelerate-v1)

## Migration assistance

* Passing in `dispatch_batches`, `split_batches`, `even_batches`, and `use_seedable_sampler` to the `Accelerator()` should now be handled by creating an `accelerate.utils.DataLoaderConfiguration()` and passing this to the `Accelerator()` instead (`Accelerator(dataloader_config=DataLoaderConfiguration(...))`)
* `Accelerator().use_fp16` and `AcceleratorState().use_fp16` have been removed; this should be replaced by checking `accelerator.mixed_precision == "fp16"`
* `Accelerator().autocast()` no longer accepts a `cache_enabled` argument. Instead, an `AutocastKwargs()` instance should be used which handles this flag (among others) passing it to the `Accelerator` (`Accelerator(kwargs_handlers=[AutocastKwargs(cache_enabled=True)])`)
* `accelerate.utils.is_tpu_available` should be replaced with `accelerate.utils.is_torch_xla_available`
* `accelerate.utils.modeling.shard_checkpoint` should be replaced with `split_torch_state_dict_into_shards` from the `huggingface_hub` library
* `accelerate.tqdm.tqdm()` no longer accepts `True`/`False` as the first argument, and instead, `main_process_only` should be passed in as a named argument

## Multiple Model DeepSpeed Support

After long request, we finally have multiple model DeepSpeed support in Accelerate! (though it is quite early still). Read the full tutorial [here](https://huggingface.co/docs/accelerate/v1.0.0/en/usage_guides/deepspeed_multiple_model#using-multiple-models-with-deepspeed), however essentially:

When using multiple models, a DeepSpeed plugin should be created for each model (and as a result, a separate config). a few examples are below:

### Knowledge distillation

(Where we train only one model, zero3, and another is used for inference, zero2)

```python
from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin

zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")
zero3_plugin = DeepSpeedPlugin(hf_ds_config="zero3_config.json")

deepspeed_plugins = {"student": zero2_plugin, "teacher": zero3_plugin}


accelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)
```

To then select which plugin to be used at a certain time (aka when calling `prepare`), we call `accelerator.state.select_deepspeed_plugin("name"), where the first plugin is active by default:

```python
accelerator.state.select_deepspeed_plugin("student")
student_model, optimizer, scheduler = ...
student_model, optimizer, scheduler, train_dataloader = accelerator.prepare(student_model, optimizer, scheduler, train_dataloader)

accelerator.state.select_deepspeed_plugin("teacher") # This will automatically enable zero init
teacher_model = AutoModel.from_pretrained(...)
teacher_model = accelerator.prepare(teacher_model)
```

### Multiple disjoint models

For disjoint models, separate accelerators should be used for each model, and their own `.backward()` should be called later:

```python
for batch in dl:
    outputs1 = first_model(**batch)
    first_accelerator.backward(outputs1.loss)
    first_optimizer.step()
    first_scheduler.step()
    first_optimizer.zero_grad()
    
    outputs2 = model2(**batch)
    second_accelerator.backward(outputs2.loss)
    second_optimizer.step()
    second_scheduler.step()
    second_optimizer.zero_grad()
```

## FP8

We've enabled MS-AMP support up to FSDP. At this time we are not going forward with implementing FSDP support with MS-AMP, due to design issues between both libraries that don't make them inter-op easily. 

## FSDP
* Fixed FSDP auto_wrap using characters instead of full str for layers
* Re-enable setting state dict type manually

## Big Modeling
* Removed cpu restriction for bnb training

## What's Changed
* Fix FSDP auto_wrap using characters instead of full str for layers by @muellerzr in https://github.com/huggingface/accelerate/pull/3075
* Allow DataLoaderAdapter subclasses to be pickled by implementing `__reduce__` by @byi8220 in https://github.com/huggingface/accelerate/pull/3074
* Fix three typos in src/accelerate/data_loader.py by @xiabingquan in https://github.com/huggingface/accelerate/pull/3082
* Re-enable setting state dict type by @muellerzr in https://github.com/huggingface/accelerate/pull/3084
* Support sequential cpu offloading with torchao quantized tensors by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3085
* fix bug in `_get_named_modules` by @faaany in https://github.com/huggingface/accelerate/pull/3052
* use the correct available memory API for XPU by @faaany in https://github.com/huggingface/accelerate/pull/3076
* fix `skip_keys` usage in forward hooks by @152334H in https://github.com/huggingface/accelerate/pull/3088
* Update README.md to include distributed image generation gist by @sayakpaul in https://github.com/huggingface/accelerate/pull/3077
* MAINT: Upgrade ruff to v0.6.4 by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3095
* Revert "Enable Unwrapping for Model State Dicts (FSDP)" by @SunMarc in https://github.com/huggingface/accelerate/pull/3096
* MS-AMP support (w/o FSDP) by @muellerzr in https://github.com/huggingface/accelerate/pull/3093
* [docs] DataLoaderConfiguration docstring by @stevhliu in https://github.com/huggingface/accelerate/pull/3103
* MAINT: Permission for GH token in stale.yml by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3102
* [docs] Doc sprint by @stevhliu in https://github.com/huggingface/accelerate/pull/3099
* Update image ref for docs by @muellerzr in https://github.com/huggingface/accelerate/pull/3105
* No more t5 by @muellerzr in https://github.com/huggingface/accelerate/pull/3107
* [docs] More docstrings by @stevhliu in https://github.com/huggingface/accelerate/pull/3108
* 🚨🚨🚨 The Great Deprecation 🚨🚨🚨 by @muellerzr in https://github.com/huggingface/accelerate/pull/3098
* POC: multiple model/configuration DeepSpeed support by @muellerzr in https://github.com/huggingface/accelerate/pull/3097
* Fixup test_sync w/ deprecated stuff by @muellerzr in https://github.com/huggingface/accelerate/pull/3109
* Switch to XLA instead of TPU by @SunMarc in https://github.com/huggingface/accelerate/pull/3118
* [tests] skip pippy tests for XPU  by @faaany in https://github.com/huggingface/accelerate/pull/3119
* Fixup multiple model DS tests by @muellerzr in https://github.com/huggingface/accelerate/pull/3131
* remove cpu restriction for bnb training by @jiqing-feng in https://github.com/huggingface/accelerate/pull/3062
* fix deprecated `torch.cuda.amp.GradScaler` FutureWarning for pytorch 2.4+ by @Mon-ius in https://github.com/huggingface/accelerate/pull/3132
* 🐛 [HotFix] Handle Profiler Activities Based on PyTorch Version by @yhna940 in https://github.com/huggingface/accelerate/pull/3136
* only move model to device when model is in cpu and target device is xpu by @faaany in https://github.com/huggingface/accelerate/pull/3133
* fix tip brackets typo by @davanstrien in https://github.com/huggingface/accelerate/pull/3129
* typo of "scalar" instead of "scaler" by @tonyzhaozh in https://github.com/huggingface/accelerate/pull/3116
* MNT Permission for PRs for GH token in stale.yml by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3112

## New Contributors
* @xiabingquan made their first contribution in https://github.com/huggingface/accelerate/pull/3082
* @a-r-r-o-w made their first contribution in https://github.com/huggingface/accelerate/pull/3085
* @152334H made their first contribution in https://github.com/huggingface/accelerate/pull/3088
* @sayakpaul made their first contribution in https://github.com/huggingface/accelerate/pull/3077
* @Mon-ius made their first contribution in https://github.com/huggingface/accelerate/pull/3132
* @davanstrien made their first contribution in https://github.com/huggingface/accelerate/pull/3129
* @tonyzhaozh made their first contribution in https://github.com/huggingface/accelerate/pull/3116

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v0.34.2...v1.0.0
</Release>

<Release version="v0.34.1" date="September 5, 2024" published="2024-09-05T15:36:16.000Z" url="https://github.com/huggingface/accelerate/releases/tag/v0.34.1">
## v0.34.1 Patchfix

## Bug fixes
* Fixes an issue where processed `DataLoaders` could no longer be pickled in #3074 thanks to @byi8220 
* Fixes an issue when using FSDP where `default_transformers_cls_names_to_wrap` would separate `_no_split_modules` by characters instead of keeping it as a list of layer names in #3075 

**Full Changelog**: https://github.com/huggingface/accelerate/compare/v0.34.0...v0.34.1
</Release>

<Pagination page="1" total-pages="4" total-items="71" next="https://releases.sh/hugging-face/accelerate.md?page=2" />