v1.8.0: FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecation
v1.8.0
FSDPv2 refactor + FP8 support
We've simplified how to prepare FSDPv2 models, as there were too many ways to compose FSDP2 with other features (e.g., FP8, torch.compile, activation checkpointing, etc.). Although the setup is now more restrictive, it leads to fewer errors and a more performant user experience. We’ve also added support for FP8. You can read about the results here. Thanks to @S1ro1 for this contribution!
- [FSDP2] Refactor + FP8 by @S1ro1 in https://github.com/huggingface/accelerate/pull/3585
Faster Distributed Training on Intel CPUs
We updated the CCL_WORKER_COUNT variable and added KMP parameters for Intel CPU users. This significantly improves distributed training performance (e.g., Tensor Parallelism), with up to a 40% speed-up on Intel 4th Gen Xeon when training transformer TP models.
- Set ccl and KMP param in simple launch by @jiqing-feng in https://github.com/huggingface/accelerate/pull/3575
Regional Compilation for DeepSpeed
We added support for regional compilation with the DeepSpeed engine. DeepSpeed’s .compile() modifies models in-place using torch.nn.Module.compile(...), rather than the out-of-place torch.compile(...), so we had to account for that. Thanks @IlyasMoutawwakil for this feature!
- Fix deepspeed regional compilation by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3609
ipex.optimize deprecation
ipex.optimize is being deprecated. Most optimizations have been upstreamed to PyTorch, and future improvements will land there directly. For users without PyTorch 2.8, we’ll continue to rely on IPEX for now.
- remove ipex.optimize in accelerate by @yao-matrix in https://github.com/huggingface/accelerate/pull/3608
Better XPU Support
We've greatly expanded and stabilized support for Intel XPUs:
- enable fsdp2 benchmark on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3590
- enable big_model_inference on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3595
- enable test_load_checkpoint_and_dispatch_with_broadcast cases on XPU by @yao-matrix in
- enable test_cli & test_example cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3578
- enable torchao and pippy test cases on XPU by @yao-matrix in https://github.com/huggingface/accelerate/pull/3599
- enable regional_compilation benchmark on xpu by @yao-matrix in https://github.com/huggingface/accelerate/pull/3592
- fix xpu 8bit value loading by @jiqing-feng in https://github.com/huggingface/accelerate/pull/3623
- add device-agnostic GradScaler by @yao-matrix in https://github.com/huggingface/accelerate/pull/3588
- add xpu support in TorchTensorParallelPlugin by @yao-matrix in https://github.com/huggingface/accelerate/pull/3627
Trackers
We've added support for SwanLab as an experiment tracking backend. Huge thanks to @ShaohonChen for this contribution ! We also deferred all tracker initializations to prevent premature setup of distributed environments.
- Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in https://github.com/huggingface/accelerate/pull/3605
- Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup by @yuanjua in https://github.com/huggingface/accelerate/pull/3581
What's Changed
- Fix bf16 training with TP by @SunMarc in https://github.com/huggingface/accelerate/pull/3610
- better handle FP8 with and without deepspeed by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3611
- Update Gaudi Runners by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3593
- goodbye torch_ccl by @yao-matrix in https://github.com/huggingface/accelerate/pull/3580
- Add support for standalone mode when default port is occupied on single node by @laitifranz in https://github.com/huggingface/accelerate/pull/3576
- Resolve logger warnings by @emmanuel-ferdman in https://github.com/huggingface/accelerate/pull/3582
- Add kwargs to optimizer, scheduler and dataloader using function
accelerator().load_state()by @luiz0992 in https://github.com/huggingface/accelerate/pull/3540 - [docs] no hard-coded cuda in the ddp documentation by @faaany in https://github.com/huggingface/accelerate/pull/3589
- change to use torch.device by @yao-matrix in https://github.com/huggingface/accelerate/pull/3594
- Fix: list object has no attribute keys by @S1ro1 in https://github.com/huggingface/accelerate/pull/3603
- Update Gaudi Runners by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3593
- Fix bf16 training with TP by @SunMarc in https://github.com/huggingface/accelerate/pull/3610
- better handle FP8 with and without deepspeed by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3611
- Remove device_count for TPU launcher to avoid initializing runtime by @sorgfresser in https://github.com/huggingface/accelerate/pull/3587
- Fix missing te.LayerNorm in intel_transformer_engine by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3619
- Add fp8_e5m2 support in
dtype_byte_sizeby @SunMarc in https://github.com/huggingface/accelerate/pull/3625 - [Deepspeed] deepspeed auto grad accum by @kashif in https://github.com/huggingface/accelerate/pull/3630
- Remove hardcoded cuda from fsdpv2 by @IlyasMoutawwakil in https://github.com/huggingface/accelerate/pull/3631
- Integrate SwanLab for offline/online experiment tracking for Accelerate by @ShaohonChen in https://github.com/huggingface/accelerate/pull/3605
- Fix Typos in Documentation and Comments by @leopardracer in https://github.com/huggingface/accelerate/pull/3621
- feat: use datasets.IterableDataset shard if possible by @SunMarc in https://github.com/huggingface/accelerate/pull/3635
- [DeepSpeed] sync gradient accum steps from deepspeed plugin by @kashif in https://github.com/huggingface/accelerate/pull/3632
- Feat: add cpu offload by @S1ro1 in https://github.com/huggingface/accelerate/pull/3636
- Fix: correct labels for fsdp2 examples by @S1ro1 in https://github.com/huggingface/accelerate/pull/3637
- fix grad acc deepspeed by @SunMarc in https://github.com/huggingface/accelerate/pull/3638
New Contributors
- @laitifranz made their first contribution in https://github.com/huggingface/accelerate/pull/3576
- @emmanuel-ferdman made their first contribution in https://github.com/huggingface/accelerate/pull/3582
- @yuanjua made their first contribution in https://github.com/huggingface/accelerate/pull/3581
- @sorgfresser made their first contribution in https://github.com/huggingface/accelerate/pull/3587
- @ShaohonChen made their first contribution in https://github.com/huggingface/accelerate/pull/3605
- @leopardracer made their first contribution in https://github.com/huggingface/accelerate/pull/3621
Full Changelog: https://github.com/huggingface/accelerate/compare/v1.7.0...v1.8.0
Fetched April 7, 2026
