v1.13.0: Neuron support, IPEX removal, and distributed…

AWS Neuron support

We now have support for AWS Neuron (Trainium/Inferentia) devices. Thanks @michaelbenayoun for adding this.

Neuron integration by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3935

XPU Improvements

We've removed IPEX dependency and improved device-agnostic code for XPU.

using spawn instead of fork for XPU device by @kaixuanliu in https://github.com/huggingface/accelerate/pull/3884
Remove ipex by @yao-matrix in https://github.com/huggingface/accelerate/pull/3883
enhance new codes to XPU, and make them be device agnostic by @yao-matrix in https://github.com/huggingface/accelerate/pull/3890
Fix KMP_AFFINITY incorrectly set for non-CPU training by @hexfaker in https://github.com/huggingface/accelerate/pull/3912

FSDP2 Improvements

We've added a bunch of important fixes for FSDP2 users: upcasting only grad-requiring params, better tied embedding errors, DCP optimizer loading, bf16 optimizer step crash fix, and torch < 2.7.0 compatibility.

Upcast FSDP2 parameters only if requires_grad by @ojh31 in https://github.com/huggingface/accelerate/pull/3848
Fix FSDP2 tied embedding errors with targeted ValueError guidance by @amanzoni1 in https://github.com/huggingface/accelerate/pull/3878
bug: fsdp cannot load optimizer state using dcp by @flymin in https://github.com/huggingface/accelerate/pull/3904
fix crash in optimizer.step when fsdp2 is enabled and model is bfloat16 by @sywangyi in https://github.com/huggingface/accelerate/pull/3905
Fix FSDP2 crash with ignored_params on torch < 2.7.0 by @Mr-Neutr0n in https://github.com/huggingface/accelerate/pull/3924

DeepSpeed Sequence Parallelism

We've added several fixes to the DeepSpeed + Sequence Parallelism integration introduced in v1.12.0, including evaluation support during SP training and proper process group handling.

[SP] fix loss computation example by @kashif in https://github.com/huggingface/accelerate/pull/3858
[SP and CP] error out if both CP and SP enabled by @kashif in https://github.com/huggingface/accelerate/pull/3862
DeepSpeed has its own process group by @kashif in https://github.com/huggingface/accelerate/pull/3916
[Deepspeed] skip device mesh creation when deepspeed and sp_size >1 by @kashif in https://github.com/huggingface/accelerate/pull/3914
Enable evaluation during deepspeed Sequence Parallel by @jp1924 in https://github.com/huggingface/accelerate/pull/3917

FP8

We've enhanced FP8 training. Thanks @shimizust for fixing torchao support.

Fix FP8 torchao default config with padding and FSDP2 all-gather support by @shimizust in https://github.com/huggingface/accelerate/pull/3831
Fix execution with Transformer Engine by @ksivaman in https://github.com/huggingface/accelerate/pull/3852
add MS-AMP deprecation warnings by @neha222222 in https://github.com/huggingface/accelerate/pull/3857

Performance

Accelerate now imports faster by deferring heavy dependencies, and torch.compile hooks are disabled lazily.

Faster import by @SunMarc in https://github.com/huggingface/accelerate/pull/3953
lazy compile disable by @SunMarc in https://github.com/huggingface/accelerate/pull/3947
Disable hook compile by @SunMarc in https://github.com/huggingface/accelerate/pull/3888

Minor fixes

Allow non-Tensor values in a batch with dispatch_batches=True by @tomaarsen in https://github.com/huggingface/accelerate/pull/3850
fix module and optimizer parameter mismatch before prepare_tp_ by @naomili0924 in https://github.com/huggingface/accelerate/pull/3845
Fix KeyError in extract_model_from_parallel for partial torch.compile by @amanzoni1 in https://github.com/huggingface/accelerate/pull/3881
Fix hf_device_map device index comparison in prepare_model by @rezaqorbani in https://github.com/huggingface/accelerate/pull/3895
Fix StatefulDataLoader KeyError with num_workers > 0 by @veeceey in https://github.com/huggingface/accelerate/pull/3931
Fix stateful dataloader DDP by @SunMarc in https://github.com/huggingface/accelerate/pull/3952
Fix: Remove duplicate W&B initialization in offline mode by @shantanugupta2004 in https://github.com/huggingface/accelerate/pull/3886
Avoid using nvidia-smi on a CPU-only Colab instance by @FlorianVal in https://github.com/huggingface/accelerate/pull/3872
Fix logging logic when in_order is set to True by @yuxinyuan in https://github.com/huggingface/accelerate/pull/3280
Fix cpu offload check by @SunMarc in https://github.com/huggingface/accelerate/pull/3946
fix bug when both cpu_ram_efficient_loading and cpu_offload are enabled by @kaixuanliu in https://github.com/huggingface/accelerate/pull/3910
Fix async compatibility across python versions by @SunMarc in https://github.com/huggingface/accelerate/pull/3901
fix tp only bug by @sywangyi in https://github.com/huggingface/accelerate/pull/3908
fix parallelism_config None error by @jp1924 in https://github.com/huggingface/accelerate/pull/3927
Np parall fix by @sywangyi in https://github.com/huggingface/accelerate/pull/3900
change the default value of fsdp_min_num_params to int by @CodeMan62 in https://github.com/huggingface/accelerate/pull/3902
Fix mutable default in Megatron init and IndexError on empty ModuleList by @jashshah999 in https://github.com/huggingface/accelerate/pull/3944
Prepare TP fix by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3945
feat: added fine tuning example focused on TPUs by @tengomucho in https://github.com/huggingface/accelerate/pull/3847
Remove 8bit force hook for bnb by @SunMarc in https://github.com/huggingface/accelerate/pull/3907
docs: flag MS-AMP as deprecated in low-precision training guides by @ManasVardhan in https://github.com/huggingface/accelerate/pull/3929
fix: correct typo 'guarentee' to 'guarantee' by @thecaptain789 in https://github.com/huggingface/accelerate/pull/3922
Updating support of Megatron-LM by @pengdurice in https://github.com/huggingface/accelerate/pull/3842
Update support of Megatron-LM PR 2 by @pengdurice in https://github.com/huggingface/accelerate/pull/3887
Fix RNG state setting for HPU by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3936
fix: load the HPU RNG state by @michaelbenayoun in https://github.com/huggingface/accelerate/pull/3937

v1.13.0: Neuron support, IPEX removal, and distributed training fixes