v1.11.0: TE MXFP8, FP16/BF16 with MPS, Python 3.10…

TE MXFP8 support

We've added support for MXFP8 in our TransformerEngine integration. To use that, you need to set use_mxfp8_block_scaling in fp8_config. See nvidia docs [here]. (https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling)

Add support for TE MXFP8 recipe in accelerate by @pstjohn in https://github.com/huggingface/accelerate/pull/3688

FP16/BF16 Training for MPS devices

BF16 and FP16 support for MPS devices is finally here. You can now pass mixed_precision = "fp16" or "bf16" when training on a mac (fp16 requires torch 2.8 and bf16 requires torch 2.6)

Add bf16/fp16 support for amp with mps device by @SunMarc in https://github.com/huggingface/accelerate/pull/3373

FSDP updates

The following PRs add respectively support to ignored_params and no_sync() for FSDPv2:

feat: add ignored_params support for fsdp2 by @kmehant in https://github.com/huggingface/accelerate/pull/3731
fix: model.set_requires_gradient_sync(False) should be called to turn off gradient synchronization in FSDP2 by @EquationWalker in https://github.com/huggingface/accelerate/pull/3762

Mixed precision can now be passed as a dtype string from accelerate cli flag or fsdp_config in accelerate config file:

feat: allow mixed precision policy as dtype by @kmehant in https://github.com/huggingface/accelerate/pull/3751

Nd-parallel updates

Some minor updates concerning nd-parallelism.

Context Parallelism docs typos fixed by @sergiopaniego in https://github.com/huggingface/accelerate/pull/3761
Feat: add to_json by @S1ro1 in https://github.com/huggingface/accelerate/pull/3743
make torch_native_parallelism examples device agnostic by @yao-matrix in https://github.com/huggingface/accelerate/pull/3759
[ND Parallel] Update examples, cleanup by @S1ro1 in https://github.com/huggingface/accelerate/pull/3737

Bump to Python 3.10

We've dropped support for python 3.9 as it reached EOL in October.

Bump to python3.10 + update linter by @SunMarc in https://github.com/huggingface/accelerate/pull/3809

Lots of minor fixes:

fix: CPU RAM efficient loading for nd or HSDP parallelisms by @kmehant in https://github.com/huggingface/accelerate/pull/3740
xpu INT64 all_gather issue fixed in 2.9 by @yao-matrix in https://github.com/huggingface/accelerate/pull/3756
Specify device_ids in torch.distributed.barrier for PartialState by @qgallouedec in https://github.com/huggingface/accelerate/pull/3744
fix: specify device for process_tensor in example usage by @qgallouedec in https://github.com/huggingface/accelerate/pull/3755
Lower complexity of get_balanced_memory by adding a set by @SamuelBarryCS in https://github.com/huggingface/accelerate/pull/3776
Fix (skip) cuda cache flush when origin device is cpu and offloaded to meta by @Qubitium in https://github.com/huggingface/accelerate/pull/3796
Fix convert LayerNorm without bias to fp8 by @mjun0812 in https://github.com/huggingface/accelerate/pull/3725
Add optional typing by @cyyever in https://github.com/huggingface/accelerate/pull/3769
refactor: Use with in Accelerator.autocast()instead of __enter__() and __exit__() for more elegant style. by @EquationWalker in https://github.com/huggingface/accelerate/pull/3767
switch XPU ccl backend to torch-builtin xccl in test_zero3_integration by @yao-matrix in https://github.com/huggingface/accelerate/pull/3773
fix FSDP2 test case failure on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3771
Fix tests by @SunMarc in https://github.com/huggingface/accelerate/pull/3722
Protect import for device_mesh by @SunMarc in https://github.com/huggingface/accelerate/pull/3742
Fix SWANLAB_MODE by @SunMarc in https://github.com/huggingface/accelerate/pull/3808
Fix tracking swanlab by @SunMarc in https://github.com/huggingface/accelerate/pull/3810
refactor: nit change for get_parameters_from_modules (code debt) by @kmehant in https://github.com/huggingface/accelerate/pull/3815
Remove deprecated FindTiedParametersResult by @cyyever in https://github.com/huggingface/accelerate/pull/3786
Add optional typing by @cyyever in https://github.com/huggingface/accelerate/pull/3769
remove mlflow from testing by @SunMarc in https://github.com/huggingface/accelerate/pull/3783
enable 2 model hook ut cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3774
Added Tip for better rendering by @sergiopaniego in https://github.com/huggingface/accelerate/pull/3781
Fix typos by @cyyever in https://github.com/huggingface/accelerate/pull/3753
fix: torch_npu import error in some envs by @yanyongyu in https://github.com/huggingface/accelerate/pull/3764
Fix: typo makes tests fail by @S1ro1 in https://github.com/huggingface/accelerate/pull/3765
fix Muti node CUDA error: invalid device ordinal #3775 by @RicardoDominguez in https://github.com/huggingface/accelerate/pull/3779
use reset_peak_memory_stats on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3772

New Contributors

@mjun0812 made their first contribution in https://github.com/huggingface/accelerate/pull/3725
@sergiopaniego made their first contribution in https://github.com/huggingface/accelerate/pull/3761
@EquationWalker made their first contribution in https://github.com/huggingface/accelerate/pull/3762
@yanyongyu made their first contribution in https://github.com/huggingface/accelerate/pull/3764
@RicardoDominguez made their first contribution in https://github.com/huggingface/accelerate/pull/3779
@SamuelBarryCS made their first contribution in https://github.com/huggingface/accelerate/pull/3776
@Qubitium made their first contribution in https://github.com/huggingface/accelerate/pull/3796

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.10.1...v1.11.0

v1.11.0: TE MXFP8, FP16/BF16 with MPS, Python 3.10