AWS Neuron support
We now have support for AWS Neuron (Trainium/Inferentia) devices. Thanks @michaelbenayoun for adding this.
- Neuron integration by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3935
XPU Improvements
We've removed IPEX dependency and improved device-agnostic code for XPU.
- using spawn instead of fork for XPU device by @kaixuanliu in https://github.com/huggingface/accelerate/pull/3884
- Remove ipex by @yao-matrix in https://github.com/huggingface/accelerate/pull/3883
- enhance new codes to XPU, and make them be device agnostic by @yao-matrix in https://github.com/huggingface/accelerate/pull/3890
- Fix KMP_AFFINITY incorrectly set for non-CPU training by @hexfaker in https://github.com/huggingface/accelerate/pull/3912
FSDP2 Improvements
We've added a bunch of important fixes for FSDP2 users: upcasting only grad-requiring params, better tied embedding errors, DCP optimizer loading, bf16 optimizer step crash fix, and torch < 2.7.0 compatibility.
- Upcast FSDP2 parameters only if requires_grad by @ojh31 in https://github.com/huggingface/accelerate/pull/3848
- Fix FSDP2 tied embedding errors with targeted ValueError guidance by @amanzoni1 in https://github.com/huggingface/accelerate/pull/3878
- bug: fsdp cannot load optimizer state using dcp by @flymin in https://github.com/huggingface/accelerate/pull/3904
- fix crash in optimizer.step when fsdp2 is enabled and model is bfloat16 by @sywangyi in https://github.com/huggingface/accelerate/pull/3905
- Fix FSDP2 crash with ignored_params on torch < 2.7.0 by @Mr-Neutr0n in https://github.com/huggingface/accelerate/pull/3924
DeepSpeed Sequence Parallelism
We've added several fixes to the DeepSpeed + Sequence Parallelism integration introduced in v1.12.0, including evaluation support during SP training and proper process group handling.
- [SP] fix loss computation example by @kashif in https://github.com/huggingface/accelerate/pull/3858
- [SP and CP] error out if both CP and SP enabled by @kashif in https://github.com/huggingface/accelerate/pull/3862
- DeepSpeed has its own process group by @kashif in https://github.com/huggingface/accelerate/pull/3916
- [Deepspeed] skip device mesh creation when deepspeed and sp_size >1 by @kashif in https://github.com/huggingface/accelerate/pull/3914
- Enable evaluation during deepspeed Sequence Parallel by @jp1924 in https://github.com/huggingface/accelerate/pull/3917
FP8
We've enhanced FP8 training. Thanks @shimizust for fixing torchao support.
- Fix FP8 torchao default config with padding and FSDP2 all-gather support by @shimizust in https://github.com/huggingface/accelerate/pull/3831
- Fix execution with Transformer Engine by @ksivaman in https://github.com/huggingface/accelerate/pull/3852
- add MS-AMP deprecation warnings by @neha222222 in https://github.com/huggingface/accelerate/pull/3857
Performance
Accelerate now imports faster by deferring heavy dependencies, and torch.compile hooks are disabled lazily.
- Faster import by @SunMarc in https://github.com/huggingface/accelerate/pull/3953
- lazy compile disable by @SunMarc in https://github.com/huggingface/accelerate/pull/3947
- Disable hook compile by @SunMarc in https://github.com/huggingface/accelerate/pull/3888
Minor fixes
- Allow non-Tensor values in a batch with dispatch_batches=True by @tomaarsen in https://github.com/huggingface/accelerate/pull/3850
- fix module and optimizer parameter mismatch before prepare_tp_ by @naomili0924 in https://github.com/huggingface/accelerate/pull/3845
- Fix KeyError in extract_model_from_parallel for partial torch.compile by @amanzoni1 in https://github.com/huggingface/accelerate/pull/3881
- Fix hf_device_map device index comparison in prepare_model by @rezaqorbani in https://github.com/huggingface/accelerate/pull/3895
- Fix StatefulDataLoader KeyError with num_workers > 0 by @veeceey in https://github.com/huggingface/accelerate/pull/3931
- Fix stateful dataloader DDP by @SunMarc in https://github.com/huggingface/accelerate/pull/3952
- Fix: Remove duplicate W&B initialization in offline mode by @shantanugupta2004 in https://github.com/huggingface/accelerate/pull/3886
- Avoid using nvidia-smi on a CPU-only Colab instance by @FlorianVal in https://github.com/huggingface/accelerate/pull/3872
- Fix logging logic when in_order is set to True by @yuxinyuan in https://github.com/huggingface/accelerate/pull/3280
- Fix cpu offload check by @SunMarc in https://github.com/huggingface/accelerate/pull/3946
- fix bug when both cpu_ram_efficient_loading and cpu_offload are enabled by @kaixuanliu in https://github.com/huggingface/accelerate/pull/3910
- Fix async compatibility across python versions by @SunMarc in https://github.com/huggingface/accelerate/pull/3901
- fix tp only bug by @sywangyi in https://github.com/huggingface/accelerate/pull/3908
- fix parallelism_config None error by @jp1924 in https://github.com/huggingface/accelerate/pull/3927
- Np parall fix by @sywangyi in https://github.com/huggingface/accelerate/pull/3900
- change the default value of fsdp_min_num_params to int by @CodeMan62 in https://github.com/huggingface/accelerate/pull/3902
- Fix mutable default in Megatron init and IndexError on empty ModuleList by @jashshah999 in https://github.com/huggingface/accelerate/pull/3944
- Prepare TP fix by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3945
- feat: added fine tuning example focused on TPUs by @tengomucho in https://github.com/huggingface/accelerate/pull/3847
- Remove 8bit force hook for bnb by @SunMarc in https://github.com/huggingface/accelerate/pull/3907
- docs: flag MS-AMP as deprecated in low-precision training guides by @ManasVardhan in https://github.com/huggingface/accelerate/pull/3929
- fix: correct typo 'guarentee' to 'guarantee' by @thecaptain789 in https://github.com/huggingface/accelerate/pull/3922
- Updating support of Megatron-LM by @pengdurice in https://github.com/huggingface/accelerate/pull/3842
- Update support of Megatron-LM PR 2 by @pengdurice in https://github.com/huggingface/accelerate/pull/3887
- Fix RNG state setting for HPU by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3936
- fix: load the HPU RNG state by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3937
