Hugging Face/Accelerate

Accelerate

Sun

Mon

Tue

Wed

Thu

Fri

Sat

JunJulAugSepOctNovDecJanFebMarAprMay

Less

Releases1Avg0/wkVersionsv1.13.0

Highlights All Releases

Mar 4, 2026

v1.13.0: Neuron support, IPEX removal, and distributed training fixes

↗

v1.13.0

AWS Neuron support

We now have support for AWS Neuron (Trainium/Inferentia) devices. Thanks @michaelbenayoun for adding this.

Neuron integration by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3935

XPU Improvements

We've removed IPEX dependency and improved device-agnostic code for XPU.

using spawn instead of fork for XPU device by @kaixuanliu in https://github.com/huggingface/accelerate/pull/3884
Remove ipex by @yao-matrix in https://github.com/huggingface/accelerate/pull/3883
enhance new codes to XPU, and make them be device agnostic by @yao-matrix in https://github.com/huggingface/accelerate/pull/3890
Fix KMP_AFFINITY incorrectly set for non-CPU training by @hexfaker in https://github.com/huggingface/accelerate/pull/3912

FSDP2 Improvements

We've added a bunch of important fixes for FSDP2 users: upcasting only grad-requiring params, better tied embedding errors, DCP optimizer loading, bf16 optimizer step crash fix, and torch < 2.7.0 compatibility.

Upcast FSDP2 parameters only if requires_grad by @ojh31 in https://github.com/huggingface/accelerate/pull/3848
Fix FSDP2 tied embedding errors with targeted ValueError guidance by @amanzoni1 in https://github.com/huggingface/accelerate/pull/3878
bug: fsdp cannot load optimizer state using dcp by @flymin in https://github.com/huggingface/accelerate/pull/3904
fix crash in optimizer.step when fsdp2 is enabled and model is bfloat16 by @sywangyi in https://github.com/huggingface/accelerate/pull/3905
Fix FSDP2 crash with ignored_params on torch < 2.7.0 by @Mr-Neutr0n in https://github.com/huggingface/accelerate/pull/3924

DeepSpeed Sequence Parallelism

We've added several fixes to the DeepSpeed + Sequence Parallelism integration introduced in v1.12.0, including evaluation support during SP training and proper process group handling.

[SP] fix loss computation example by @kashif in https://github.com/huggingface/accelerate/pull/3858
[SP and CP] error out if both CP and SP enabled by @kashif in https://github.com/huggingface/accelerate/pull/3862
DeepSpeed has its own process group by @kashif in https://github.com/huggingface/accelerate/pull/3916
[Deepspeed] skip device mesh creation when deepspeed and sp_size >1 by @kashif in https://github.com/huggingface/accelerate/pull/3914
Enable evaluation during deepspeed Sequence Parallel by @jp1924 in https://github.com/huggingface/accelerate/pull/3917

FP8

We've enhanced FP8 training. Thanks @shimizust for fixing torchao support.

Fix FP8 torchao default config with padding and FSDP2 all-gather support by @shimizust in https://github.com/huggingface/accelerate/pull/3831
Fix execution with Transformer Engine by @ksivaman in https://github.com/huggingface/accelerate/pull/3852
add MS-AMP deprecation warnings by @neha222222 in https://github.com/huggingface/accelerate/pull/3857

Performance

Accelerate now imports faster by deferring heavy dependencies, and torch.compile hooks are disabled lazily.

Faster import by @SunMarc in https://github.com/huggingface/accelerate/pull/3953
lazy compile disable by @SunMarc in https://github.com/huggingface/accelerate/pull/3947
Disable hook compile by @SunMarc in https://github.com/huggingface/accelerate/pull/3888

Minor fixes

Allow non-Tensor values in a batch with dispatch_batches=True by @tomaarsen in https://github.com/huggingface/accelerate/pull/3850
fix module and optimizer parameter mismatch before prepare_tp_ by @naomili0924 in https://github.com/huggingface/accelerate/pull/3845
Fix KeyError in extract_model_from_parallel for partial torch.compile by @amanzoni1 in https://github.com/huggingface/accelerate/pull/3881
Fix hf_device_map device index comparison in prepare_model by @rezaqorbani in https://github.com/huggingface/accelerate/pull/3895
Fix StatefulDataLoader KeyError with num_workers > 0 by @veeceey in https://github.com/huggingface/accelerate/pull/3931
Fix stateful dataloader DDP by @SunMarc in https://github.com/huggingface/accelerate/pull/3952
Fix: Remove duplicate W&B initialization in offline mode by @shantanugupta2004 in https://github.com/huggingface/accelerate/pull/3886
Avoid using nvidia-smi on a CPU-only Colab instance by @FlorianVal in https://github.com/huggingface/accelerate/pull/3872
Fix logging logic when in_order is set to True by @yuxinyuan in https://github.com/huggingface/accelerate/pull/3280
Fix cpu offload check by @SunMarc in https://github.com/huggingface/accelerate/pull/3946
fix bug when both cpu_ram_efficient_loading and cpu_offload are enabled by @kaixuanliu in https://github.com/huggingface/accelerate/pull/3910
Fix async compatibility across python versions by @SunMarc in https://github.com/huggingface/accelerate/pull/3901
fix tp only bug by @sywangyi in https://github.com/huggingface/accelerate/pull/3908
fix parallelism_config None error by @jp1924 in https://github.com/huggingface/accelerate/pull/3927
Np parall fix by @sywangyi in https://github.com/huggingface/accelerate/pull/3900
change the default value of fsdp_min_num_params to int by @CodeMan62 in https://github.com/huggingface/accelerate/pull/3902
Fix mutable default in Megatron init and IndexError on empty ModuleList by @jashshah999 in https://github.com/huggingface/accelerate/pull/3944
Prepare TP fix by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3945
feat: added fine tuning example focused on TPUs by @tengomucho in https://github.com/huggingface/accelerate/pull/3847
Remove 8bit force hook for bnb by @SunMarc in https://github.com/huggingface/accelerate/pull/3907
docs: flag MS-AMP as deprecated in low-precision training guides by @ManasVardhan in https://github.com/huggingface/accelerate/pull/3929
fix: correct typo 'guarentee' to 'guarantee' by @thecaptain789 in https://github.com/huggingface/accelerate/pull/3922
Updating support of Megatron-LM by @pengdurice in https://github.com/huggingface/accelerate/pull/3842
Update support of Megatron-LM PR 2 by @pengdurice in https://github.com/huggingface/accelerate/pull/3887
Fix RNG state setting for HPU by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3936
fix: load the HPU RNG state by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3937

Nov 21, 2025

v1.12.0: Deepspeed Ulysses/ALST

↗

v1.12.0

Deepspeed Ulysses/ALST integration

Deepspeed Ulysses/ALST is an efficient way of training on long sequences by employing sequence parallelism and attention head parallelism. You can learn more about this technology in this paper https://arxiv.org/abs/2506.13996 or this deepspeed tutorial https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/.

To enable Deepspeed Ulysses, you first need to create ParallelismConfig and setting sp related args:

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=2,
    sp_handler=DeepSpeedSequenceParallelConfig(...),
)

Then, you need to make sure to compute the correct loss as described on our docs

        ...
        losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
        good_tokens = (shift_labels != -100).view(-1).sum()
        good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
        total_loss = sum(
            losses_per_rank[rank] * good_tokens_per_rank[rank]
            for rank in range(sp_world_size)
            if good_tokens_per_rank[rank] > 0
        )
        total_good_tokens = sum(good_tokens_per_rank)
        loss = total_loss / max(total_good_tokens, 1)

Thanks @S1ro1 for starting this work and for @stas00 for finishing this work. Also thanks @kashif for adding docs and reviewing/testing this PR !

This feature will also be available in HF Trainer thanks for this PR from @stas00: https://github.com/huggingface/transformers/pull/41832

Minor changes

Remove warning for cpu_ram_efficient_loading by @SunMarc in https://github.com/huggingface/accelerate/pull/3816
update typo in bnb quantisation 4bit flag docstring by @hbraith in https://github.com/huggingface/accelerate/pull/3828
ArXiv -> HF Papers by @qgallouedec in https://github.com/huggingface/accelerate/pull/3834
Fix typo in broadcast_object_list docstring by @wsntxxn in https://github.com/huggingface/accelerate/pull/3823
[Bug] Update torch.optim.Optimizer parameter states after tensor parallelism by @naomili0924 in https://github.com/huggingface/accelerate/pull/3835
use self hosted runner by @SunMarc in https://github.com/huggingface/accelerate/pull/3841
device type helper by @kashif in https://github.com/huggingface/accelerate/pull/3843

New Contributors

@hbraith made their first contribution in https://github.com/huggingface/accelerate/pull/3828
@wsntxxn made their first contribution in https://github.com/huggingface/accelerate/pull/3823
@naomili0924 made their first contribution in https://github.com/huggingface/accelerate/pull/3835

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.11.0...v1.12.0

Oct 20, 2025

v1.11.0: TE MXFP8, FP16/BF16 with MPS, Python 3.10

↗

v1.11.0

TE MXFP8 support

We've added support for MXFP8 in our TransformerEngine integration. To use that, you need to set use_mxfp8_block_scaling in fp8_config. See nvidia docs [here]. (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling)

Add support for TE MXFP8 recipe in accelerate by @pstjohn in https://github.com/huggingface/accelerate/pull/3688

FP16/BF16 Training for MPS devices

BF16 and FP16 support for MPS devices is finally here. You can now pass mixed_precision = "fp16" or "bf16" when training on a mac (fp16 requires torch 2.8 and bf16 requires torch 2.6)

Add bf16/fp16 support for amp with mps device by @SunMarc in https://github.com/huggingface/accelerate/pull/3373

FSDP updates

The following PRs add respectively support to ignored_params and no_sync() for FSDPv2:

feat: add ignored_params support for fsdp2 by @kmehant in https://github.com/huggingface/accelerate/pull/3731
fix: model.set_requires_gradient_sync(False) should be called to turn off gradient synchronization in FSDP2 by @EquationWalker in https://github.com/huggingface/accelerate/pull/3762

Mixed precision can now be passed as a dtype string from accelerate cli flag or fsdp_config in accelerate config file:

feat: allow mixed precision policy as dtype by @kmehant in https://github.com/huggingface/accelerate/pull/3751

Nd-parallel updates

Some minor updates concerning nd-parallelism.

Context Parallelism docs typos fixed by @sergiopaniego in https://github.com/huggingface/accelerate/pull/3761
Feat: add to_json by @S1ro1 in https://github.com/huggingface/accelerate/pull/3743
make torch_native_parallelism examples device agnostic by @yao-matrix in https://github.com/huggingface/accelerate/pull/3759
[ND Parallel] Update examples, cleanup by @S1ro1 in https://github.com/huggingface/accelerate/pull/3737

Bump to Python 3.10

We've dropped support for python 3.9 as it reached EOL in October.

Bump to python3.10 + update linter by @SunMarc in https://github.com/huggingface/accelerate/pull/3809

Lots of minor fixes:

fix: CPU RAM efficient loading for nd or HSDP parallelisms by @kmehant in https://github.com/huggingface/accelerate/pull/3740
xpu INT64 all_gather issue fixed in 2.9 by @yao-matrix in https://github.com/huggingface/accelerate/pull/3756
Specify device_ids in torch.distributed.barrier for PartialState by @qgallouedec in https://github.com/huggingface/accelerate/pull/3744
fix: specify device for process_tensor in example usage by @qgallouedec in https://github.com/huggingface/accelerate/pull/3755
Lower complexity of get_balanced_memory by adding a set by @SamuelBarryCS in https://github.com/huggingface/accelerate/pull/3776
Fix (skip) cuda cache flush when origin device is cpu and offloaded to meta by @Qubitium in https://github.com/huggingface/accelerate/pull/3796
Fix convert LayerNorm without bias to fp8 by @mjun0812 in https://github.com/huggingface/accelerate/pull/3725
Add optional typing by @cyyever in https://github.com/huggingface/accelerate/pull/3769
refactor: Use with in Accelerator.autocast()instead of __enter__() and __exit__() for more elegant style. by @EquationWalker in https://github.com/huggingface/accelerate/pull/3767
switch XPU ccl backend to torch-builtin xccl in test_zero3_integration by @yao-matrix in https://github.com/huggingface/accelerate/pull/3773
fix FSDP2 test case failure on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3771
Fix tests by @SunMarc in https://github.com/huggingface/accelerate/pull/3722
Protect import for device_mesh by @SunMarc in https://github.com/huggingface/accelerate/pull/3742
Fix SWANLAB_MODE by @SunMarc in https://github.com/huggingface/accelerate/pull/3808
Fix tracking swanlab by @SunMarc in https://github.com/huggingface/accelerate/pull/3810
refactor: nit change for get_parameters_from_modules (code debt) by @kmehant in https://github.com/huggingface/accelerate/pull/3815
Remove deprecated FindTiedParametersResult by @cyyever in https://github.com/huggingface/accelerate/pull/3786
Add optional typing by @cyyever in https://github.com/huggingface/accelerate/pull/3769
remove mlflow from testing by @SunMarc in https://github.com/huggingface/accelerate/pull/3783
enable 2 model hook ut cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3774
Added Tip for better rendering by @sergiopaniego in https://github.com/huggingface/accelerate/pull/3781
Fix typos by @cyyever in https://github.com/huggingface/accelerate/pull/3753
fix: torch_npu import error in some envs by @yanyongyu in https://github.com/huggingface/accelerate/pull/3764
Fix: typo makes tests fail by @S1ro1 in https://github.com/huggingface/accelerate/pull/3765
fix Muti node CUDA error: invalid device ordinal #3775 by @RicardoDominguez in https://github.com/huggingface/accelerate/pull/3779
use reset_peak_memory_stats on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3772

New Contributors

@mjun0812 made their first contribution in https://github.com/huggingface/accelerate/pull/3725
@sergiopaniego made their first contribution in https://github.com/huggingface/accelerate/pull/3761
@EquationWalker made their first contribution in https://github.com/huggingface/accelerate/pull/3762
@yanyongyu made their first contribution in https://github.com/huggingface/accelerate/pull/3764
@RicardoDominguez made their first contribution in https://github.com/huggingface/accelerate/pull/3779
@SamuelBarryCS made their first contribution in https://github.com/huggingface/accelerate/pull/3776
@Qubitium made their first contribution in https://github.com/huggingface/accelerate/pull/3796

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.10.1...v1.11.0

Aug 25, 2025

v1.10.1: Patchfix

↗

v1.10.1

Feat: add to_json by @S1ro1 in https://github.com/huggingface/accelerate/pull/3743
Protect import for device_mesh by @SunMarc in https://github.com/huggingface/accelerate/pull/3742.

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.10.0...v1.10.1

Aug 7, 2025

v1.10.0: N-D Parallelism

↗

v1.10.0

N-D Parallelism

Training large models across multiple GPUs can be complex, especially when combining different parallelism strategies (e.g TP, CP, DP). To simplify this process, we've collaborated with Axolotl to introduce an easy-to-use integration that allows you to apply any combination of parallelism strategies directly in your training script. Just pass a ParallelismConfig specifying the size of each parallelism type—it's that simple. Learn more about how it works in our latest blogpost.

parallelism_config = ParallelismConfig(
    dp_shard_size=2,
    dp_replicate_size=2,
    cp_size=2,
    tp_size=2,
)
accelerator = Accelerator(
    parallelism_config=parallelism_config,
   ...
)
model = AutoModelForCausalLM.from_pretrained("your-model-name", device_mesh=accelerator.torch_device_mesh)
model = accelerator.prepare(model)

Parallelism config + TP + HSDP + BYODM (Bring Your Own Device Mesh) by @SalmanMohammadi in https://github.com/huggingface/accelerate/pull/3682
Feat: context parallel v2.0 by @S1ro1 in https://github.com/huggingface/accelerate/pull/3700
set default submesh_tp_size to prevent unset local variable error by @winglian in https://github.com/huggingface/accelerate/pull/3687
Add Parallelism getter property to Accelerator class by @WoosungMyung in https://github.com/huggingface/accelerate/pull/3703
Fix: prepare works even if nothing except tp specified (rare) by @S1ro1 in https://github.com/huggingface/accelerate/pull/3707
Set parallelism_config in constructor due to Trainer reset of State by @winglian in https://github.com/huggingface/accelerate/pull/3713
Fix: tp size wouldn't read from env by @S1ro1 in https://github.com/huggingface/accelerate/pull/3716
Remove ParallelismConfig from PartialState by @SunMarc in https://github.com/huggingface/accelerate/pull/3720

FSDP improvements

We've fixed ignored modules attribute. With this, it is now possible to train PEFT model that moe layers that contrains q_proj and v_proj parameters. This is especially important for fine-tuning gpt-oss model.

ENH: Allow FSDP ignored modules to be regex by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3698
TST Add test for FSDP ignored_modules as str by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3719

Minor improvements

feature: CpuOffload pre_forward don't attempt to move if already on device by @JoeGaffney in https://github.com/huggingface/accelerate/pull/3695
Fix: Ensure environment variable values are case-insensitive in Accelerate by @jp1924 in https://github.com/huggingface/accelerate/pull/3712
remove use_ipex by @SunMarc in https://github.com/huggingface/accelerate/pull/3721

New Contributors

@SalmanMohammadi made their first contribution in https://github.com/huggingface/accelerate/pull/3682
@WoosungMyung made their first contribution in https://github.com/huggingface/accelerate/pull/3703
@jp1924 made their first contribution in https://github.com/huggingface/accelerate/pull/3712
@JoeGaffney made their first contribution in https://github.com/huggingface/accelerate/pull/3695

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.9.0...v1.10.0

Jul 16, 2025

v1.9.0: Trackio support, Model loading speedup, Minor distributed improvements

↗

v1.9.0

Trackio tracker support

We've added support for a trackio, lightweight, 💯 free experiment tracking Python library built on top of 🤗 Datasets and Spaces.

Main features are:

Local-first design: dashboard runs locally by default. You can also host it on Spaces by specifying a space_id.
Persists logs locally (or in a private Hugging Face Dataset)
Visualize experiments with a Gradio dashboard locally (or on Hugging Face Spaces)
Everything here, including hosting on Hugging Faces, is free!

To use it with accelerate, you need to set log_with and initialize the trackers

accelerator = Accelerator(log_with="trackio")
config={"learning_rate": 0.001, "batch_size": 32}
# init_kwargs in order to host the dashboard on spaces
init_kwargs = {"trackio": {"space_id": "hf_username/space_name"}
accelerator.init_trackers("example_project", config=config, init_kwargs=init_kwargs})

Thanks @pcuenca for the integration !

trackio by @pcuenca in https://github.com/huggingface/accelerate/pull/3669

Model loading speedup when relying `set_module_tensor_to_device`

Setting tensor while clearing cache is very slow, so we added clear_device option to disable it. Another small optimization is using non_blocking everywhere and syncing just before returning control to the user. This makes the loading slightly faster.

Speedup model loading by 4-5x in Diffusers ⚡ by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3674

FDSP, Deepspeed, FP8 minor improvements

Add support for e5e2 and default to hybrid when launcher is used by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3640
Fix FP8 tests, enable FP8 to be used without direct Accelerator() configuring by @pstjohn in https://github.com/huggingface/accelerate/pull/3677
Bunch of FSDP improvements by @S1ro1 in https://github.com/huggingface/accelerate/pull/3671
Fix: properly error when DDP + Dtensor model by @S1ro1 in https://github.com/huggingface/accelerate/pull/3629
Fix fsdp2 example typo by @shimizust in https://github.com/huggingface/accelerate/pull/3657
Added a check in no_sync() to avoid errors when using deepspeed zero2/3 by @xliu0105 in https://github.com/huggingface/accelerate/pull/3656

🚨🚨🚨 Breaking changes 🚨🚨🚨

find_executable_batch_size() will no longer halves the batch after every OOM. Instead, we will multiply the batch size by 0.9. This should help user not waste gpu capacity.

“Stop Halving My Batch!” · Default back-off 0.5 → 0.9 by @SunMarc in https://github.com/huggingface/accelerate/pull/3684

What's Changed

[typo] shards instead of shard by @SunMarc in https://github.com/huggingface/accelerate/pull/3645
Docs: Fix typos in gradient accumulation guide by @kilavvy in https://github.com/huggingface/accelerate/pull/3649
xpu enablement on left cases by @yao-matrix in https://github.com/huggingface/accelerate/pull/3654
unpin datasets in examples requirements by @SunMarc in https://github.com/huggingface/accelerate/pull/3681
fix: wandb config not saved in offline mode by @ved1beta in https://github.com/huggingface/accelerate/pull/3648
accelerate/data_loader.py: do not yield if the base_dataloader is empty by @0xnightwind in https://github.com/huggingface/accelerate/pull/3659
warn for invalid keys by @ved1beta in https://github.com/huggingface/accelerate/pull/3613
Update Gaudi runner image to latest SynapseAI and enable previously disabled tests by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3653

New Contributors

@kilavvy made their first contribution in https://github.com/huggingface/accelerate/pull/3649
@shimizust made their first contribution in https://github.com/huggingface/accelerate/pull/3657
@xliu0105 made their first contribution in https://github.com/huggingface/accelerate/pull/3656
@0xnightwind made their first contribution in https://github.com/huggingface/accelerate/pull/3659

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.8.1...v1.9.0

Jun 20, 2025

v1.8.1: Patchfix

↗

v1.8.1

Add support for e5e2 and default to hybrid when launcher is used by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3640
shards by @SunMarc in https://github.com/huggingface/accelerate/pull/3645

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.8.0...v1.8.1

Jun 19, 2025

v1.8.0: FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecation

↗

v1.8.0

FSDPv2 refactor + FP8 support

We've simplified how to prepare FSDPv2 models, as there were too many ways to compose FSDP2 with other features (e.g., FP8, torch.compile, activation checkpointing, etc.). Although the setup is now more restrictive, it leads to fewer errors and a more performant user experience. We’ve also added support for FP8. You can read about the results here. Thanks to @S1ro1 for this contribution!

[FSDP2] Refactor + FP8 by @S1ro1 in https://github.com/huggingface/accelerate/pull/3585

Faster Distributed Training on Intel CPUs

We updated the CCL_WORKER_COUNT variable and added KMP parameters for Intel CPU users. This significantly improves distributed training performance (e.g., Tensor Parallelism), with up to a 40% speed-up on Intel 4th Gen Xeon when training transformer TP models.

Set ccl and KMP param in simple launch by @jiqing-feng in https://github.com/huggingface/accelerate/pull/3575

Regional Compilation for DeepSpeed

We added support for regional compilation with the DeepSpeed engine. DeepSpeed’s .compile() modifies models in-place using torch.nn.Module.compile(...), rather than the out-of-place torch.compile(...), so we had to account for that. Thanks @IlyasMoutawwakil for this feature!

Fix deepspeed regional compilation by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3609

ipex.optimize deprecation

ipex.optimize is being deprecated. Most optimizations have been upstreamed to PyTorch, and future improvements will land there directly. For users without PyTorch 2.8, we’ll continue to rely on IPEX for now.

remove ipex.optimize in accelerate by @yao-matrix in https://github.com/huggingface/accelerate/pull/3608

Better XPU Support

We've greatly expanded and stabilized support for Intel XPUs:

enable fsdp2 benchmark on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3590
enable big_model_inference on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3595
enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU by @yao-matrix in
enable test_cli & test_example cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3578
enable torchao and pippy test cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3599
enable regional_compilation benchmark on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3592
fix xpu 8bit value loading by @jiqing-feng in https://github.com/huggingface/accelerate/pull/3623
add device-agnostic GradScaler by @yao-matrix in https://github.com/huggingface/accelerate/pull/3588
add xpu support in TorchTensorParallelPlugin by @yao-matrix in https://github.com/huggingface/accelerate/pull/3627

Trackers

We've added support for SwanLab as an experiment tracking backend. Huge thanks to @ShaohonChen for this contribution ! We also deferred all tracker initializations to prevent premature setup of distributed environments.

Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in https://github.com/huggingface/accelerate/pull/3605
Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup by @yuanjua in https://github.com/huggingface/accelerate/pull/3581

What's Changed

Fix bf16 training with TP by @SunMarc in https://github.com/huggingface/accelerate/pull/3610
better handle FP8 with and without deepspeed by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3611
Update Gaudi Runners by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3593
goodbye torch_ccl by @yao-matrix in https://github.com/huggingface/accelerate/pull/3580
Add support for standalone mode when default port is occupied on single node by @laitifranz in https://github.com/huggingface/accelerate/pull/3576
Resolve logger warnings by @emmanuel-ferdman in https://github.com/huggingface/accelerate/pull/3582
Add kwargs to optimizer, scheduler and dataloader using function accelerator().load_state() by @luiz0992 in https://github.com/huggingface/accelerate/pull/3540
[docs] no hard-coded cuda in the ddp documentation by @faaany in https://github.com/huggingface/accelerate/pull/3589
change to use torch.device by @yao-matrix in https://github.com/huggingface/accelerate/pull/3594
Fix: list object has no attribute keys by @S1ro1 in https://github.com/huggingface/accelerate/pull/3603
Update Gaudi Runners by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3593
Fix bf16 training with TP by @SunMarc in https://github.com/huggingface/accelerate/pull/3610
better handle FP8 with and without deepspeed by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3611
Remove device_count for TPU launcher to avoid initializing runtime by @sorgfresser in https://github.com/huggingface/accelerate/pull/3587
Fix missing te.LayerNorm in intel_transformer_engine by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3619
Add fp8_e5m2 support in dtype_byte_size by @SunMarc in https://github.com/huggingface/accelerate/pull/3625
[Deepspeed] deepspeed auto grad accum by @kashif in https://github.com/huggingface/accelerate/pull/3630
Remove hardcoded cuda from fsdpv2 by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3631
Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in https://github.com/huggingface/accelerate/pull/3605
Fix Typos in Documentation and Comments by @leopardracer in https://github.com/huggingface/accelerate/pull/3621
feat: use datasets.IterableDataset shard if possible by @SunMarc in https://github.com/huggingface/accelerate/pull/3635
[DeepSpeed] sync gradient accum steps from deepspeed plugin by @kashif in https://github.com/huggingface/accelerate/pull/3632
Feat: add cpu offload by @S1ro1 in https://github.com/huggingface/accelerate/pull/3636
Fix: correct labels for fsdp2 examples by @S1ro1 in https://github.com/huggingface/accelerate/pull/3637
fix grad acc deepspeed by @SunMarc in https://github.com/huggingface/accelerate/pull/3638

New Contributors

@laitifranz made their first contribution in https://github.com/huggingface/accelerate/pull/3576
@emmanuel-ferdman made their first contribution in https://github.com/huggingface/accelerate/pull/3582
@yuanjua made their first contribution in https://github.com/huggingface/accelerate/pull/3581
@sorgfresser made their first contribution in https://github.com/huggingface/accelerate/pull/3587
@ShaohonChen made their first contribution in https://github.com/huggingface/accelerate/pull/3605
@leopardracer made their first contribution in https://github.com/huggingface/accelerate/pull/3621

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.7.0...v1.8.0

May 15, 2025

v1.7.0 : Regional compilation, Layerwise casting hook, FSDPv2 + QLoRA

↗

v1.7.0

Regional compilation

Instead of compiling the entire model at once, regional compilation targets repeated blocks (such as decoder layers) first. This allows the compiler to cache and reuse optimized code for subsequent blocks, significantly reducing the cold start compilation time typically seen during the first inference. Thanks @IlyasMoutawwakil for the feature ! You can view the full benchmark here, and check out our updated compilation guide for more details!

To enable this feature, set use_regional_compilation=True in the TorchDynamoPlugin configuration.

# Configure the compilation backend
dynamo_plugin = TorchDynamoPlugin(
    use_regional_compilation=True,
    ... # other parameters
)
# Initialize accelerator with the plugin
accelerator = Accelerator(dynamo_plugin=dynamo_plugin)
# This will apply compile_regions to your model
model = accelerator.prepare(model)

Layerwise casting hook

We've introduced a new hook that enables per-layer upcasting and downcasting (e.g., for Linear layers) during inference. This allows users to run models with separate storage and compute dtypes, resulting in memory savings. The concept was first implemented in diffusers, where downcasting models to FP8 proved effective without major quality degradation. Contributed by @sayakpaul in https://github.com/huggingface/accelerate/pull/3427

model = ....
storage_dtype = torch.float8_e4m3fn
compute_dtype = torch.bfloat16
attach_layerwise_casting_hooks(
            model,
            storage_dtype=storage_dtype,
            compute_dtype=compute_dtype,
        )

Better FSDP2 support

This release includes numerous new features and bug fixes. Notably, we’ve added support for FULL_STATE_DICT, a widely used option in FSDP, now enabling .save_pretrained() in transformers to work with FSDP2 wrapped models. QLoRA training is now supported as well but more testing is needed. We have also resolved a backend issue related to parameter offloading to CPU. Additionally, a significant memory spike that occurred when cpu_ram_efficient_loading=True was enabled has been fixed. Several other minor improvements and fixes are also included—see the What’s Changed section for full details.

FULL_STATE_DICT have been enabled by @S1ro1 in https://github.com/huggingface/accelerate/pull/3527
QLoRA support by @winglian in https://github.com/huggingface/accelerate/pull/3546
set backend correctly for CUDA+FSDP2+cpu-offload in https://github.com/huggingface/accelerate/pull/3574
memory spike fixed when using cpu_ram_efficient_loading=True by @S1ro1 in https://github.com/huggingface/accelerate/pull/3482

Better HPU support:

We have added a documentation for Intel Gaudi hardware ! The support is already available since v1.5.0 through this PR.

Add the HPU into accelerate config by @yuanwu2017 in https://github.com/huggingface/accelerate/pull/3495
Add Gaudi doc by @regisss in https://github.com/huggingface/accelerate/pull/3537

Torch.compile breaking change for `dynamic` argument

We've updated the logic for setting self.dynamic to explicitly preserve None rather than defaulting to False when the USE_DYNAMIC environment variable is unset. This change aligns the behavior with the PyTorch documentation for torch.compile. Thanks to @yafshar for contributing this improvement in #3567.

What's Changed

use device agnostic torch.OutOfMemoryError from pytorch 2.5.0 by @yao-matrix in https://github.com/huggingface/accelerate/pull/3475
Adds style bot by @zach-huggingface in https://github.com/huggingface/accelerate/pull/3478
Fix a tiny typo in low_precision_training guide by @sadra-barikbin in https://github.com/huggingface/accelerate/pull/3488
Fix check_tied_parameters_in_config for multimodal models by @SunMarc in https://github.com/huggingface/accelerate/pull/3479
Don't create new param for TorchAO sequential offloading due to weak BC guarantees by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3444
add support for custom function for reducing the batch size by @winglian in https://github.com/huggingface/accelerate/pull/3071
Fix fp8 deepspeed config by @SunMarc in https://github.com/huggingface/accelerate/pull/3492
fix warning error by @faaany in https://github.com/huggingface/accelerate/pull/3491
[bug] unsafe_serialization option in "merge-weights" doesn't work by @cyr0930 in https://github.com/huggingface/accelerate/pull/3496
Add the HPU into accelerate config by @yuanwu2017 in https://github.com/huggingface/accelerate/pull/3495
Use torch.distributed.checkpoint.state_dict.set_model_state_dict in load_checkpoint_in_model by @ringohoffman in https://github.com/huggingface/accelerate/pull/3432
nit: needed sanity checks for fsdp2 by @kmehant in https://github.com/huggingface/accelerate/pull/3499
(Part 1) fix: make TP training compatible with new transformers by @kmehant in https://github.com/huggingface/accelerate/pull/3457
Fix deepspeed tests by @S1ro1 in https://github.com/huggingface/accelerate/pull/3503
Add FP8 runners + tweak building FP8 image by @zach-huggingface in https://github.com/huggingface/accelerate/pull/3493
fix: apply torchfix to set weights_only=True by @bzhong-solink in https://github.com/huggingface/accelerate/pull/3497
Fix: require transformers version for tp tests by @S1ro1 in https://github.com/huggingface/accelerate/pull/3504
Remove deprecated PyTorch/XLA APIs by @zpcore in https://github.com/huggingface/accelerate/pull/3484
Fix cache issue by upgrading github actions version by @SunMarc in https://github.com/huggingface/accelerate/pull/3513
[Feat] Layerwise casting hook by @sayakpaul in https://github.com/huggingface/accelerate/pull/3427
Add torchao to FP8 error message by @jphme in https://github.com/huggingface/accelerate/pull/3514
Fix unwanted cuda init due to torchao by @SunMarc in https://github.com/huggingface/accelerate/pull/3530
Solve link error in internal_mechanism documentation (#3506) by @alvaro-mazcu in https://github.com/huggingface/accelerate/pull/3507
[FSDP2] Enable FULL_STATE_DICT by @S1ro1 in https://github.com/huggingface/accelerate/pull/3527
[FSDP2] Fix memory spike with cpu_ram_efficient_loading=True by @S1ro1 in https://github.com/huggingface/accelerate/pull/3482
[FSDP2] Issues in Wrap Policy and Mixed Precision by @jhliu17 in https://github.com/huggingface/accelerate/pull/3528
Fix logic in accelerator.prepare + IPEX for 2+ nn.Models and/or optim.Optimizers by @mariusarvinte in https://github.com/huggingface/accelerate/pull/3517
Update Docker builds to align with CI requirements by @matthewdouglas in https://github.com/huggingface/accelerate/pull/3532
Fix CI due to missing package by @SunMarc in https://github.com/huggingface/accelerate/pull/3535
Update big_modeling.md for layerwise casting by @sayakpaul in https://github.com/huggingface/accelerate/pull/3548
[FSDP2] Fix: "..." is not a buffer or a paremeter by @S1ro1 in
fix notebook_launcher for Colab TPU compatibility. by @BogdanDidenko in https://github.com/huggingface/accelerate/pull/3541
Fix typos by @omahs in https://github.com/huggingface/accelerate/pull/3549
Dynamo regional compilation by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3529
add support for port 0 auto-selection in multi-GPU environments by @hellobiondi in https://github.com/huggingface/accelerate/pull/3501
Fix the issue where set_epoch does not take effect. by @hongjx175 in https://github.com/huggingface/accelerate/pull/3556
[FSDP2] Fix casting in _cast_and_contiguous by @dlvp in https://github.com/huggingface/accelerate/pull/3559
[FSDP] Make env var and dataclass flag consistent for cpu_ram_efficient_loading by @SumanthRH in https://github.com/huggingface/accelerate/pull/3307
canonicalize optimized names before fixing optimizer in fdsp2 by @pstjohn in https://github.com/huggingface/accelerate/pull/3560
[docs] update deepspeed config path by @faaany in https://github.com/huggingface/accelerate/pull/3561
preserve parameter keys when removing prefix by @mjkvaak-amd in https://github.com/huggingface/accelerate/pull/3564
Add Gaudi doc by @regisss in https://github.com/huggingface/accelerate/pull/3537
Update dynamic env handling to preserve None when USE_DYNAMIC is unset by @yafshar in https://github.com/huggingface/accelerate/pull/3567
add a synchronize call for xpu in _gpu_gather by @faaany in https://github.com/huggingface/accelerate/pull/3563
simplify model.to logic by @yao-matrix in https://github.com/huggingface/accelerate/pull/3562
tune env command output by @yao-matrix in https://github.com/huggingface/accelerate/pull/3570
Add regional compilation to cli tools and env vars by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3572
reenable FSDP2+qlora support by @winglian in https://github.com/huggingface/accelerate/pull/3546
Fix prevent duplicate GPU usage in distributed processing by @ved1beta in https://github.com/huggingface/accelerate/pull/3526
set backend correctly for CUDA+FSDP2+cpu-offload by @SunMarc in https://github.com/huggingface/accelerate/pull/3574
enable test_dispatch_model_tied_weights_memory_with_nested_offload_cpu on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3569

New Contributors

@zach-huggingface made their first contribution in https://github.com/huggingface/accelerate/pull/3478
@sadra-barikbin made their first contribution in https://github.com/huggingface/accelerate/pull/3488
@ringohoffman made their first contribution in https://github.com/huggingface/accelerate/pull/3432
@bzhong-solink made their first contribution in https://github.com/huggingface/accelerate/pull/3497
@zpcore made their first contribution in https://github.com/huggingface/accelerate/pull/3484
@jphme made their first contribution in https://github.com/huggingface/accelerate/pull/3514
@alvaro-mazcu made their first contribution in https://github.com/huggingface/accelerate/pull/3507
@jhliu17 made their first contribution in https://github.com/huggingface/accelerate/pull/3528
@BogdanDidenko made their first contribution in https://github.com/huggingface/accelerate/pull/3541
@hellobiondi made their first contribution in https://github.com/huggingface/accelerate/pull/3501
@hongjx175 made their first contribution in https://github.com/huggingface/accelerate/pull/3556
@dlvp made their first contribution in https://github.com/huggingface/accelerate/pull/3559
@pstjohn made their first contribution in https://github.com/huggingface/accelerate/pull/3560
@mjkvaak-amd made their first contribution in https://github.com/huggingface/accelerate/pull/3564
@yafshar made their first contribution in https://github.com/huggingface/accelerate/pull/3567
@ved1beta made their first contribution in https://github.com/huggingface/accelerate/pull/3526

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.6.0...v1.7.0

Apr 1, 2025

v1.6.0: FSDPv2, DeepSpeed TP and XCCL backend support

↗

v1.6.0

FSDPv2 support

This release introduces the support for FSDPv2 thanks to @S1ro1.

If you are using python code, you need to set fsdp_version=2 in FullyShardedDataParallelPlugin:

from accelerate import FullyShardedDataParallelPlugin, Accelerator

fsdp_plugin = FullyShardedDataParallelPlugin(
    fsdp_version=2
    # other options...
)
accelerator = Accelerator(fsdp_plugin=fsdp_plugin)

If want to convert a YAML config that contains the FSDPv1 config to FSDPv2 one , use our conversion tool:

accelerate to-fsdp2 --config_file config.yaml --output_file new_config.yaml`

To learn more about the difference between FSDPv1 and FSDPv2, read the following documentation.

DeepSpeed TP support

We have added initial support for DeepSpeed + TP. Not many changes were required as the DeepSpeed APIs was already compatible. We only needed to make sure that the dataloader was compatible with TP and that we were able to save the TP weights. Thanks @inkcherry for the work ! https://github.com/huggingface/accelerate/pull/3390.

To use TP with deepspeed, you need to update the setting in the deepspeed config file by including tensor_parallel key:

    ....
    "tensor_parallel":{
      "autotp_size": ${autotp_size}
    },
   ...

More details in this deepspeed PR.

Support for XCCL distributed backend

We've added support for XCCL which is an Intel distributed backend which can be used with XPU devices. More details in this torch PR. Thanks @dvrogozh for the integration !

What's Changed

Add log_artifact, log_artifacts and log_figure capabilities to the MLflowTracker. by @luiz0992 in https://github.com/huggingface/accelerate/pull/3419
tensor parallel dataloder for deepspeed accelerator by @inkcherry in https://github.com/huggingface/accelerate/pull/3390
Fix prod issues by @muellerzr in https://github.com/huggingface/accelerate/pull/3441
Fix attribute issue with deepspeed tp by @SunMarc in https://github.com/huggingface/accelerate/pull/3443
Fixed typo in the multi node FSDP slurm example script by @JacobB33 in https://github.com/huggingface/accelerate/pull/3447
feat: Add no_ssh and slurm multinode launcher options for deepspeed by @hsmallbone in https://github.com/huggingface/accelerate/pull/3329
Fixup ao module filter func by @muellerzr in https://github.com/huggingface/accelerate/pull/3450
remove device index workaround on xpu since xpu supports integer device index as cuda now by @yao-matrix in https://github.com/huggingface/accelerate/pull/3448
enable 2 UT cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3445
Fix AMD GPU support with should_reduce_batch_size() by @cameronshinn in https://github.com/huggingface/accelerate/pull/3405
Fix device KeyError in tied_params_map by @dvrogozh in https://github.com/huggingface/accelerate/pull/3403
Initial FSDP2 support by @S1ro1 in https://github.com/huggingface/accelerate/pull/3394
Fix: clip grad norm in fsdp2 by @S1ro1 in https://github.com/huggingface/accelerate/pull/3465
Update @ by @muellerzr in https://github.com/huggingface/accelerate/pull/3466
Fix seeding of new generator for multi GPU by @albertcthomas in https://github.com/huggingface/accelerate/pull/3459
Fix get_balanced_memory for MPS by @booxter in https://github.com/huggingface/accelerate/pull/3464
Update CometMLTracker to allow re-using experiment by @Lothiraldan in https://github.com/huggingface/accelerate/pull/3328
Apply ruff py39 fixes by @cyyever in https://github.com/huggingface/accelerate/pull/3461
xpu: enable xccl distributed backend by @dvrogozh in https://github.com/huggingface/accelerate/pull/3401
Update ruff target-version to py39 and apply more fixes by @cyyever in https://github.com/huggingface/accelerate/pull/3470
[MLU] fix deepspeed dependency by @huismiling in https://github.com/huggingface/accelerate/pull/3472
remove use_xpu to fix ut issues, we don't need this since XPU is OOB … by @yao-matrix in https://github.com/huggingface/accelerate/pull/3460
Bump ruff to 0.11.2 by @cyyever in https://github.com/huggingface/accelerate/pull/3471

New Contributors

@luiz0992 made their first contribution in https://github.com/huggingface/accelerate/pull/3419
@inkcherry made their first contribution in https://github.com/huggingface/accelerate/pull/3390
@JacobB33 made their first contribution in https://github.com/huggingface/accelerate/pull/3447
@hsmallbone made their first contribution in https://github.com/huggingface/accelerate/pull/3329
@yao-matrix made their first contribution in https://github.com/huggingface/accelerate/pull/3448
@cameronshinn made their first contribution in https://github.com/huggingface/accelerate/pull/3405
@S1ro1 made their first contribution in https://github.com/huggingface/accelerate/pull/3394
@albertcthomas made their first contribution in https://github.com/huggingface/accelerate/pull/3459
@booxter made their first contribution in https://github.com/huggingface/accelerate/pull/3464
@Lothiraldan made their first contribution in https://github.com/huggingface/accelerate/pull/3328
@cyyever made their first contribution in https://github.com/huggingface/accelerate/pull/3461

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.5.2...v1.6.0

Mar 14, 2025

Patch: v1.5.2

↗

v1.5.2

Bug Fixes:

Fixed an issue with torch.get_default_device() requiring a higher version than what we support
Fixed a broken pytest import in prod

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.5.0...v1.5.2

Mar 12, 2025

v1.5.0: HPU support

↗

v1.5.0

HPU Support

Adds in HPU accelerator support for 🤗 Accelerate

What's Changed

[bug] fix device index bug for model training loaded with bitsandbytes by @faaany in https://github.com/huggingface/accelerate/pull/3408
[docs] add the missing import torch by @faaany in https://github.com/huggingface/accelerate/pull/3396
minor doc fixes by @nbroad1881 in https://github.com/huggingface/accelerate/pull/3365
fix: ensure CLI args take precedence over config file. by @cyr0930 in https://github.com/huggingface/accelerate/pull/3409
fix: Add device=torch.get_default_device() in torch.Generators by @saforem2 in https://github.com/huggingface/accelerate/pull/3420
Add Tecorigin SDAA accelerator support by @siqi654321 in https://github.com/huggingface/accelerate/pull/3330
fix typo : thier -> their by @hackty in https://github.com/huggingface/accelerate/pull/3423
Fix quality by @muellerzr in https://github.com/huggingface/accelerate/pull/3424
Distributed inference example for llava_next by @VladOS95-cyber in https://github.com/huggingface/accelerate/pull/3417
HPU support by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3378

New Contributors

@cyr0930 made their first contribution in https://github.com/huggingface/accelerate/pull/3409
@saforem2 made their first contribution in https://github.com/huggingface/accelerate/pull/3420
@siqi654321 made their first contribution in https://github.com/huggingface/accelerate/pull/3330
@hackty made their first contribution in https://github.com/huggingface/accelerate/pull/3423
@VladOS95-cyber made their first contribution in https://github.com/huggingface/accelerate/pull/3417
@IlyasMoutawwakil made their first contribution in https://github.com/huggingface/accelerate/pull/3378

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.4.0...v1.5.0

Feb 17, 2025

v1.4.0: `torchao` FP8, TP & dataLoader support, fix memory leak

↗

v1.4.0

`torchao` FP8, initial Tensor Parallel support, and memory leak fixes

`torchao` FP8

This release introduces a new FP8 API and brings in a new backend: torchao. To use, pass in AORecipeKwargs to the Accelerator while setting mixed_precision="fp8". This is initial support, as it matures we will incorporate more into it (such as accelerate config/yaml) in future releases. See our benchmark examples here

TensorParallel

We have intial support for an in-house solution to TP when working with accelerate dataloaders. check out the PR here

Jan 17, 2025

v1.3.0 Bug fixes + Require torch 2.0

↗

v1.3.0

Torch 2.0

As it's been ~2 years since torch 2.0 was first released, we are now requiring this as the minimum version for Accelerate, which similarly was done in transformers as of its last release.

Core

[docs] no hard-coding cuda by @faaany in https://github.com/huggingface/accelerate/pull/3270
fix load_state_dict for npu by @ji-huazhong in https://github.com/huggingface/accelerate/pull/3211
Add keep_torch_compile param to unwrap_model and extract_model_from_parallel for distributed compiled model. by @ggoggam in https://github.com/huggingface/accelerate/pull/3282
[tests] make cuda-only test case device-agnostic by @faaany in https://github.com/huggingface/accelerate/pull/3340
latest bnb no longer has optim_args attribute on optimizer by @winglian in https://github.com/huggingface/accelerate/pull/3311
add torchdata version check to avoid "in_order" error by @faaany in https://github.com/huggingface/accelerate/pull/3344
[docs] fix typo, change "backoff_filter" to "backoff_factor" by @suchot in https://github.com/huggingface/accelerate/pull/3296
dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in https://github.com/huggingface/accelerate/pull/3346
feat(tpu): remove nprocs from xla.spawn by @tengomucho in https://github.com/huggingface/accelerate/pull/3324

Big Modeling

Fix test_nested_hook by @SunMarc in https://github.com/huggingface/accelerate/pull/3289
correct the return statement of _init_infer_auto_device_map by @Nech-C in https://github.com/huggingface/accelerate/pull/3279
Use torch.xpu.mem_get_info for XPU by @dvrogozh in https://github.com/huggingface/accelerate/pull/3275
Ensure that tied parameter is children of module by @pablomlago in https://github.com/huggingface/accelerate/pull/3327
Fix for offloading when using TorchAO >= 0.7.0 by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3332
Fix offload generate tests by @SunMarc in https://github.com/huggingface/accelerate/pull/3334

Examples

Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in https://github.com/huggingface/accelerate/pull/3193

Full Changelog

What's Changed

[docs] no hard-coding cuda by @faaany in https://github.com/huggingface/accelerate/pull/3270
fix load_state_dict for npu by @ji-huazhong in https://github.com/huggingface/accelerate/pull/3211
Fix test_nested_hook by @SunMarc in https://github.com/huggingface/accelerate/pull/3289
correct the return statement of _init_infer_auto_device_map by @Nech-C in https://github.com/huggingface/accelerate/pull/3279
Give example on how to handle gradient accumulation with cross-entropy by @ylacombe in https://github.com/huggingface/accelerate/pull/3193
Use torch.xpu.mem_get_info for XPU by @dvrogozh in https://github.com/huggingface/accelerate/pull/3275
Add keep_torch_compile param to unwrap_model and extract_model_from_parallel for distributed compiled model. by @ggoggam in https://github.com/huggingface/accelerate/pull/3282
Ensure that tied parameter is children of module by @pablomlago in https://github.com/huggingface/accelerate/pull/3327
Bye bye torch <2 by @muellerzr in https://github.com/huggingface/accelerate/pull/3331
Fixup docker build err by @muellerzr in https://github.com/huggingface/accelerate/pull/3333
feat(tpu): remove nprocs from xla.spawn by @tengomucho in https://github.com/huggingface/accelerate/pull/3324
Fix offload generate tests by @SunMarc in https://github.com/huggingface/accelerate/pull/3334
[tests] make cuda-only test case device-agnostic by @faaany in https://github.com/huggingface/accelerate/pull/3340
latest bnb no longer has optim_args attribute on optimizer by @winglian in https://github.com/huggingface/accelerate/pull/3311
Fix for offloading when using TorchAO >= 0.7.0 by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3332
add torchdata version check to avoid "in_order" error by @faaany in https://github.com/huggingface/accelerate/pull/3344
[docs] fix typo, change "backoff_filter" to "backoff_factor" by @suchot in https://github.com/huggingface/accelerate/pull/3296
dataloader: check that in_order is in kwargs before trying to drop it by @dvrogozh in https://github.com/huggingface/accelerate/pull/3346

New Contributors

@ylacombe made their first contribution in https://github.com/huggingface/accelerate/pull/3193
@ggoggam made their first contribution in https://github.com/huggingface/accelerate/pull/3282
@pablomlago made their first contribution in https://github.com/huggingface/accelerate/pull/3327
@tengomucho made their first contribution in https://github.com/huggingface/accelerate/pull/3324
@suchot made their first contribution in https://github.com/huggingface/accelerate/pull/3296

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.2.1...v1.3.0

Dec 13, 2024

v1.2.1: Patchfix

↗

v1.2.1

fix: add max_memory to _init_infer_auto_device_map's return statement in https://github.com/huggingface/accelerate/pull/3279 by @Nech-C
fix load_state_dict for npu in https://github.com/huggingface/accelerate/pull/3211 by @statelesshz

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.2.0...v1.2.1

v1.2.0: Bug Squashing & Fixes across the board

↗

v1.2.0

Core

enable find_executable_batch_size on XPU by @faaany in https://github.com/huggingface/accelerate/pull/3236
Use numpy._core instead of numpy.core by @qgallouedec in https://github.com/huggingface/accelerate/pull/3247
Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in https://github.com/huggingface/accelerate/pull/3066
Allow for full dynamo config passed to Accelerator by @muellerzr in https://github.com/huggingface/accelerate/pull/3251
[WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3252
[data_loader] Optionally also propagate set_epoch to batch sampler by @tomaarsen in https://github.com/huggingface/accelerate/pull/3246
use XPU instead of GPU in the accelerate config prompt text by @faaany in https://github.com/huggingface/accelerate/pull/3268

Big Modeling

Fix align_module_device, ensure only cpu tensors for get_state_dict_offloaded_model by @kylesayrs in https://github.com/huggingface/accelerate/pull/3217
Remove hook for bnb 4-bit by @SunMarc in https://github.com/huggingface/accelerate/pull/3223
[docs] add instruction to install bnb on non-cuda devices by @faaany in https://github.com/huggingface/accelerate/pull/3227
Take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in https://github.com/huggingface/accelerate/pull/3226
Update deferring_execution.md by @max-yue in https://github.com/huggingface/accelerate/pull/3262
Revert default behavior of get_state_dict_from_offload by @kylesayrs in https://github.com/huggingface/accelerate/pull/3253
Fix: Resolve #3060, preload_module_classes is lost for nested modules by @wejoncy in https://github.com/huggingface/accelerate/pull/3248

DeepSpeed

Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in https://github.com/huggingface/accelerate/pull/3255
support for wrapped schedulefree optimizer when using deepspeed by @winglian in https://github.com/huggingface/accelerate/pull/3266

Documentation

Update code in tracking documentation by @faaany in https://github.com/huggingface/accelerate/pull/3235
Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in https://github.com/huggingface/accelerate/pull/3259
Update set-seed by @faaany in https://github.com/huggingface/accelerate/pull/3228
Fix typo by @faaany in https://github.com/huggingface/accelerate/pull/3221
Use real path for checkpoint by @faaany in https://github.com/huggingface/accelerate/pull/3220
Fixed multiple typos for Tutorials and Guides docs by @henryhmko in https://github.com/huggingface/accelerate/pull/3274

New Contributors

@winglian made their first contribution in https://github.com/huggingface/accelerate/pull/3266
@max-yue made their first contribution in https://github.com/huggingface/accelerate/pull/3262
@as12138 made their first contribution in https://github.com/huggingface/accelerate/pull/3261
@relh made their first contribution in https://github.com/huggingface/accelerate/pull/3259
@wejoncy made their first contribution in https://github.com/huggingface/accelerate/pull/3248
@henryhmko made their first contribution in https://github.com/huggingface/accelerate/pull/3274

Full Changelog

Fix align_module_device, ensure only cpu tensors for get_state_dict_offloaded_model by @kylesayrs in https://github.com/huggingface/accelerate/pull/3217
remove hook for bnb 4-bit by @SunMarc in https://github.com/huggingface/accelerate/pull/3223
enable find_executable_batch_size on XPU by @faaany in https://github.com/huggingface/accelerate/pull/3236
take care of case when "_tied_weights_keys" is not an attribute by @fabianlim in https://github.com/huggingface/accelerate/pull/3226
[docs] update code in tracking documentation by @faaany in https://github.com/huggingface/accelerate/pull/3235
Add warnings and fallback for unassigned devices in infer_auto_device_map by @Nech-C in https://github.com/huggingface/accelerate/pull/3066
[data_loader] Optionally also propagate set_epoch to batch sampler by @tomaarsen in https://github.com/huggingface/accelerate/pull/3246
[docs] add instruction to install bnb on non-cuda devices by @faaany in https://github.com/huggingface/accelerate/pull/3227
Use numpy._core instead of numpy.core by @qgallouedec in https://github.com/huggingface/accelerate/pull/3247
Allow for full dynamo config passed to Accelerator by @muellerzr in https://github.com/huggingface/accelerate/pull/3251
[WIP] FEAT Decorator to purge accelerate env vars by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3252
use XPU instead of GPU in the accelerate config prompt text by @faaany in https://github.com/huggingface/accelerate/pull/3268
support for wrapped schedulefree optimizer when using deepspeed by @winglian in https://github.com/huggingface/accelerate/pull/3266
Update deferring_execution.md by @max-yue in https://github.com/huggingface/accelerate/pull/3262
Fix: Resolve #3257 by @as12138 in https://github.com/huggingface/accelerate/pull/3261
Replaced set/check breakpoint with set/check trigger in the troubleshooting documentation by @relh in https://github.com/huggingface/accelerate/pull/3259
Select the DeepSpeedCPUOptimizer based on the original optimizer class. by @eljandoubi in https://github.com/huggingface/accelerate/pull/3255
Revert default behavior of get_state_dict_from_offload by @kylesayrs in https://github.com/huggingface/accelerate/pull/3253
Fix: Resolve #3060, preload_module_classes is lost for nested modules by @wejoncy in https://github.com/huggingface/accelerate/pull/3248
[docs] update set-seed by @faaany in https://github.com/huggingface/accelerate/pull/3228
[docs] fix typo by @faaany in https://github.com/huggingface/accelerate/pull/3221
[docs] use real path for checkpoint by @faaany in https://github.com/huggingface/accelerate/pull/3220
Fixed multiple typos for Tutorials and Guides docs by @henryhmko in https://github.com/huggingface/accelerate/pull/3274

Code Diff

Release diff: https://github.com/huggingface/accelerate/compare/v1.1.1...v1.2.0

Nov 1, 2024

v1.1.0: Python 3.9 minimum, torch dynamo deepspeed support, and bug fixes

↗

v1.1.0

Internals:

Allow for a data_seed argument in https://github.com/huggingface/accelerate/pull/3150
Trigger weights_only=True by default for all compatible objects when checkpointing and saving with torch.save in https://github.com/huggingface/accelerate/pull/3036
Handle negative values for dim input in pad_across_processes in https://github.com/huggingface/accelerate/pull/3114
Enable cpu bnb distributed lora finetune in https://github.com/huggingface/accelerate/pull/3159

DeepSpeed

Support torch dynamo for deepspeed>=0.14.4 in https://github.com/huggingface/accelerate/pull/3069

Megatron

update Megatron-LM plugin code to version 0.8.0 or higher in https://github.com/huggingface/accelerate/pull/3174

Big Model Inference

New has_offloaded_params utility added in https://github.com/huggingface/accelerate/pull/3188

Examples

Florence2 distributed inference example in https://github.com/huggingface/accelerate/pull/3123

Full Changelog

Handle negative values for dim input in pad_across_processes by @mariusarvinte in https://github.com/huggingface/accelerate/pull/3114
Fixup DS issue with weakref by @muellerzr in https://github.com/huggingface/accelerate/pull/3143
Refactor scaler to util by @muellerzr in https://github.com/huggingface/accelerate/pull/3142
DS fix, continued by @muellerzr in https://github.com/huggingface/accelerate/pull/3145
Florence2 distributed inference example by @hlky in https://github.com/huggingface/accelerate/pull/3123
POC: Allow for a data_seed by @muellerzr in https://github.com/huggingface/accelerate/pull/3150
Adding multi gpu speech generation by @dame-cell in https://github.com/huggingface/accelerate/pull/3149
support torch dynamo for deepspeed>=0.14.4 by @oraluben in https://github.com/huggingface/accelerate/pull/3069
Fixup Zero3 + save_model by @muellerzr in https://github.com/huggingface/accelerate/pull/3146
Trigger weights_only=True by default for all compatible objects by @muellerzr in https://github.com/huggingface/accelerate/pull/3036
Remove broken dynamo test by @oraluben in https://github.com/huggingface/accelerate/pull/3155
fix version check bug in get_xpu_available_memory by @faaany in https://github.com/huggingface/accelerate/pull/3165
enable cpu bnb distributed lora finetune by @jiqing-feng in https://github.com/huggingface/accelerate/pull/3159
[Utils] has_offloaded_params by @kylesayrs in https://github.com/huggingface/accelerate/pull/3188
fix bnb by @eljandoubi in https://github.com/huggingface/accelerate/pull/3186
[docs] update neptune API by @faaany in https://github.com/huggingface/accelerate/pull/3181
docs: fix a wrong word in comment in src/accelerate/accelerate.py:1255 by @Rebornix-zero in https://github.com/huggingface/accelerate/pull/3183
[docs] use nn.module instead of tensor as model by @faaany in https://github.com/huggingface/accelerate/pull/3157
Fix typo by @kylesayrs in https://github.com/huggingface/accelerate/pull/3191
MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in https://github.com/huggingface/accelerate/pull/3187
update Megatron-LM plugin code to version 0.8.0 or higher. by @eljandoubi in https://github.com/huggingface/accelerate/pull/3174
🚨 🚨 🚨 Goodbye Python 3.8! 🚨 🚨 🚨 by @muellerzr in https://github.com/huggingface/accelerate/pull/3194
Update transformers.deepspeed references from transformers 4.46.0 release by @loadams in https://github.com/huggingface/accelerate/pull/3196
eliminate dead code by @statelesshz in https://github.com/huggingface/accelerate/pull/3198
take torch.nn.Module model into account when moving to device by @faaany in https://github.com/huggingface/accelerate/pull/3167
[docs] add xpu part and fix bug in torchrun by @faaany in https://github.com/huggingface/accelerate/pull/3166
Models With Tied Weights Need Re-Tieing After FSDP Param Init by @fabianlim in https://github.com/huggingface/accelerate/pull/3154
add the missing xpu for local sgd by @faaany in https://github.com/huggingface/accelerate/pull/3163
typo fix in big_modeling.py by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3207
[Utils] align_module_device by @kylesayrs in https://github.com/huggingface/accelerate/pull/3204

New Contributors

@mariusarvinte made their first contribution in https://github.com/huggingface/accelerate/pull/3114
@hlky made their first contribution in https://github.com/huggingface/accelerate/pull/3123
@dame-cell made their first contribution in https://github.com/huggingface/accelerate/pull/3149
@kylesayrs made their first contribution in https://github.com/huggingface/accelerate/pull/3188
@eljandoubi made their first contribution in https://github.com/huggingface/accelerate/pull/3186
@Rebornix-zero made their first contribution in https://github.com/huggingface/accelerate/pull/3183
@loadams made their first contribution in https://github.com/huggingface/accelerate/pull/3196

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.0.1...v1.1.0

Oct 12, 2024

v1.0.1: Bugfix

↗

v1.0.1

Bugfixes

Fixes an issue where the auto values were no longer being parsed when using deepspeed
Fixes a broken test in the deepspeed tests related to the auto values

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.0.0...v1.0.1

Oct 7, 2024

Accelerate 1.0.0 is here!

↗

v1.0.0

🚀 Accelerate 1.0 🚀

With accelerate 1.0, we are officially stating that the core parts of the API are now "stable" and ready for the future of what the world of distributed training and PyTorch has to handle. With these release notes, we will focus first on the major breaking changes to get your code fixed, followed by what is new specifically between 0.34.0 and 1.0.

To read more, check out our official blog here

Migration assistance

Passing in dispatch_batches, split_batches, even_batches, and use_seedable_sampler to the Accelerator() should now be handled by creating an accelerate.utils.DataLoaderConfiguration() and passing this to the Accelerator() instead (Accelerator(dataloader_config=DataLoaderConfiguration(...)))
Accelerator().use_fp16 and AcceleratorState().use_fp16 have been removed; this should be replaced by checking accelerator.mixed_precision == "fp16"
Accelerator().autocast() no longer accepts a cache_enabled argument. Instead, an AutocastKwargs() instance should be used which handles this flag (among others) passing it to the Accelerator (Accelerator(kwargs_handlers=[AutocastKwargs(cache_enabled=True)]))
accelerate.utils.is_tpu_available should be replaced with accelerate.utils.is_torch_xla_available
accelerate.utils.modeling.shard_checkpoint should be replaced with split_torch_state_dict_into_shards from the huggingface_hub library
accelerate.tqdm.tqdm() no longer accepts True/False as the first argument, and instead, main_process_only should be passed in as a named argument

Multiple Model DeepSpeed Support

After long request, we finally have multiple model DeepSpeed support in Accelerate! (though it is quite early still). Read the full tutorial here, however essentially:

When using multiple models, a DeepSpeed plugin should be created for each model (and as a result, a separate config). a few examples are below:

Knowledge distillation

(Where we train only one model, zero3, and another is used for inference, zero2)

from accelerate import Accelerator
from accelerate.utils import DeepSpeedPlugin

zero2_plugin = DeepSpeedPlugin(hf_ds_config="zero2_config.json")
zero3_plugin = DeepSpeedPlugin(hf_ds_config="zero3_config.json")

deepspeed_plugins = {"student": zero2_plugin, "teacher": zero3_plugin}


accelerator = Accelerator(deepspeed_plugins=deepspeed_plugins)

To then select which plugin to be used at a certain time (aka when calling prepare), we call `accelerator.state.select_deepspeed_plugin("name"), where the first plugin is active by default:

accelerator.state.select_deepspeed_plugin("student")
student_model, optimizer, scheduler = ...
student_model, optimizer, scheduler, train_dataloader = accelerator.prepare(student_model, optimizer, scheduler, train_dataloader)

accelerator.state.select_deepspeed_plugin("teacher") # This will automatically enable zero init
teacher_model = AutoModel.from_pretrained(...)
teacher_model = accelerator.prepare(teacher_model)

Multiple disjoint models

For disjoint models, separate accelerators should be used for each model, and their own .backward() should be called later:

for batch in dl:
    outputs1 = first_model(**batch)
    first_accelerator.backward(outputs1.loss)
    first_optimizer.step()
    first_scheduler.step()
    first_optimizer.zero_grad()
    
    outputs2 = model2(**batch)
    second_accelerator.backward(outputs2.loss)
    second_optimizer.step()
    second_scheduler.step()
    second_optimizer.zero_grad()

FP8

We've enabled MS-AMP support up to FSDP. At this time we are not going forward with implementing FSDP support with MS-AMP, due to design issues between both libraries that don't make them inter-op easily.

FSDP

Fixed FSDP auto_wrap using characters instead of full str for layers
Re-enable setting state dict type manually

Big Modeling

Removed cpu restriction for bnb training

What's Changed

Fix FSDP auto_wrap using characters instead of full str for layers by @muellerzr in https://github.com/huggingface/accelerate/pull/3075
Allow DataLoaderAdapter subclasses to be pickled by implementing __reduce__ by @byi8220 in https://github.com/huggingface/accelerate/pull/3074
Fix three typos in src/accelerate/data_loader.py by @xiabingquan in https://github.com/huggingface/accelerate/pull/3082
Re-enable setting state dict type by @muellerzr in https://github.com/huggingface/accelerate/pull/3084
Support sequential cpu offloading with torchao quantized tensors by @a-r-r-o-w in https://github.com/huggingface/accelerate/pull/3085
fix bug in _get_named_modules by @faaany in https://github.com/huggingface/accelerate/pull/3052
use the correct available memory API for XPU by @faaany in https://github.com/huggingface/accelerate/pull/3076
fix skip_keys usage in forward hooks by @152334H in https://github.com/huggingface/accelerate/pull/3088
Update README.md to include distributed image generation gist by @sayakpaul in https://github.com/huggingface/accelerate/pull/3077
MAINT: Upgrade ruff to v0.6.4 by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3095
Revert "Enable Unwrapping for Model State Dicts (FSDP)" by @SunMarc in https://github.com/huggingface/accelerate/pull/3096
MS-AMP support (w/o FSDP) by @muellerzr in https://github.com/huggingface/accelerate/pull/3093
[docs] DataLoaderConfiguration docstring by @stevhliu in https://github.com/huggingface/accelerate/pull/3103
MAINT: Permission for GH token in stale.yml by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3102
[docs] Doc sprint by @stevhliu in https://github.com/huggingface/accelerate/pull/3099
Update image ref for docs by @muellerzr in https://github.com/huggingface/accelerate/pull/3105
No more t5 by @muellerzr in https://github.com/huggingface/accelerate/pull/3107
[docs] More docstrings by @stevhliu in https://github.com/huggingface/accelerate/pull/3108
🚨🚨🚨 The Great Deprecation 🚨🚨🚨 by @muellerzr in https://github.com/huggingface/accelerate/pull/3098
POC: multiple model/configuration DeepSpeed support by @muellerzr in https://github.com/huggingface/accelerate/pull/3097
Fixup test_sync w/ deprecated stuff by @muellerzr in https://github.com/huggingface/accelerate/pull/3109
Switch to XLA instead of TPU by @SunMarc in https://github.com/huggingface/accelerate/pull/3118
[tests] skip pippy tests for XPU by @faaany in https://github.com/huggingface/accelerate/pull/3119
Fixup multiple model DS tests by @muellerzr in https://github.com/huggingface/accelerate/pull/3131
remove cpu restriction for bnb training by @jiqing-feng in https://github.com/huggingface/accelerate/pull/3062
fix deprecated torch.cuda.amp.GradScaler FutureWarning for pytorch 2.4+ by @Mon-ius in https://github.com/huggingface/accelerate/pull/3132
🐛 [HotFix] Handle Profiler Activities Based on PyTorch Version by @yhna940 in https://github.com/huggingface/accelerate/pull/3136
only move model to device when model is in cpu and target device is xpu by @faaany in https://github.com/huggingface/accelerate/pull/3133
fix tip brackets typo by @davanstrien in https://github.com/huggingface/accelerate/pull/3129
typo of "scalar" instead of "scaler" by @tonyzhaozh in https://github.com/huggingface/accelerate/pull/3116
MNT Permission for PRs for GH token in stale.yml by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/3112

New Contributors

@xiabingquan made their first contribution in https://github.com/huggingface/accelerate/pull/3082
@a-r-r-o-w made their first contribution in https://github.com/huggingface/accelerate/pull/3085
@152334H made their first contribution in https://github.com/huggingface/accelerate/pull/3088
@sayakpaul made their first contribution in https://github.com/huggingface/accelerate/pull/3077
@Mon-ius made their first contribution in https://github.com/huggingface/accelerate/pull/3132
@davanstrien made their first contribution in https://github.com/huggingface/accelerate/pull/3129
@tonyzhaozh made their first contribution in https://github.com/huggingface/accelerate/pull/3116

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.34.2...v1.0.0

Sep 5, 2024

v0.34.1 Patchfix

↗

v0.34.1

Bug fixes

Fixes an issue where processed DataLoaders could no longer be pickled in #3074 thanks to @byi8220
Fixes an issue when using FSDP where default_transformers_cls_names_to_wrap would separate _no_split_modules by characters instead of keeping it as a list of layer names in #3075

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.34.0...v0.34.1

More from Hugging Face

Last Checked

14h ago

Latest

v1.13.0

Source

@huggingface/accelerate9.7k

Tracking since Mar 5, 2021

.json·.md·.atom

Accelerate

AWS Neuron support

XPU Improvements

FSDP2 Improvements

DeepSpeed Sequence Parallelism

FP8

Performance

Minor fixes

Deepspeed Ulysses/ALST integration

Minor changes

New Contributors

TE MXFP8 support

FP16/BF16 Training for MPS devices

FSDP updates

Nd-parallel updates

Bump to Python 3.10

Lots of minor fixes:

New Contributors

N-D Parallelism

FSDP improvements

Minor improvements

New Contributors

Trackio tracker support

Model loading speedup when relying set_module_tensor_to_device

FDSP, Deepspeed, FP8 minor improvements

🚨🚨🚨 Breaking changes 🚨🚨🚨

What's Changed

New Contributors

FSDPv2 refactor + FP8 support

Faster Distributed Training on Intel CPUs

Regional Compilation for DeepSpeed

ipex.optimize deprecation

Better XPU Support

Trackers

What's Changed

New Contributors

Regional compilation

Layerwise casting hook

Better FSDP2 support

Better HPU support:

Torch.compile breaking change for dynamic argument

What's Changed

New Contributors

FSDPv2 support

DeepSpeed TP support

Support for XCCL distributed backend

What's Changed

New Contributors

HPU Support

What's Changed

New Contributors

torchao FP8, initial Tensor Parallel support, and memory leak fixes

torchao FP8

TensorParallel

Bug fixes

What's Changed

New Contributors

Torch 2.0

Core

Big Modeling

Examples

Full Changelog

What's Changed

New Contributors

Core

Big Modeling

DeepSpeed

Documentation

New Contributors

Full Changelog

Code Diff

Internals:

DeepSpeed

Megatron

Big Model Inference

Examples

Full Changelog

New Contributors

Bugfixes

🚀 Accelerate 1.0 🚀

Model loading speedup when relying `set_module_tensor_to_device`

Torch.compile breaking change for `dynamic` argument

`torchao` FP8, initial Tensor Parallel support, and memory leak fixes

`torchao` FP8