Hugging Face/Accelerate

Accelerate

$npx -y @buildinternet/releases show accelerate

Sun

Mon

Tue

Wed

Thu

Fri

Sat

AprMayJunJulAugSepOctNovDecJanFebMarApr

Less

Releases1Avg0/wkVersionsv1.13.0

Sep 3, 2024

v0.34.0: StatefulDataLoader Support, FP8 Improvements, and PyTorch Updates!

Dependency Changes

Updated Safetensors Requirement: The library now requires safetensors version 0.4.3.
Added support for Numpy 2.0: The library now fully supports numpy 2.0.0

Core

New Script Behavior Changes

Process Group Management: PyTorch now requires users to destroy process groups after training. The accelerate library will handle this automatically with accelerator.end_training(), or you can do it manually using PartialState().destroy_process_group().
MLU Device Support: Added support for saving and loading RNG states on MLU devices by @huismiling
NPU Support: Corrected backend and distributed settings when using transfer_to_npu, ensuring better performance and compatibility.

DataLoader Enhancements

Stateful DataDataLoader: We are excited to announce that early support has been added for the StatefulDataLoader from torchdata, allowing better handling of data loading states. Enable by passing use_stateful_dataloader=True to the DataLoaderConfiguration, and when calling load_state() the DataLoader will automatically be resumed from its last step, no more having to iterate through passed batches.
Decoupled Data Loader Preparation: The prepare_data_loader() function is now independent of the Accelerator, giving you more flexibility towards which API levels you would like to use.
XLA Compatibility: Added support for skipping initial batches when using XLA.
Improved State Management: Bug fixes and enhancements for saving/loading DataLoader states, ensuring smoother training sessions.
Epoch Setting: Introduced the set_epoch function for MpDeviceLoaderWrapper.

FP8 Training Improvements

Enhanced FP8 Training: Fully Sharded Data Parallelism (FSDP) and DeepSpeed support now work seamlessly with TransformerEngine FP8 training, including better defaults for the quantized FP8 weights.
Integration baseline: We've added a new suite of examples and benchmarks to ensure that our TransformerEngine integration works exactly as intended. These scripts run one half using 🤗 Accelerate's integration, the other with raw TransformersEngine, providing users with a nice example of what we do under the hood with accelerate, and a good sanity check to make sure nothing breaks down over time. Find them here
Import Fixes: Resolved issues with import checks for the Transformers Engine that has downstream issues.
FP8 Docker Images: We've added new docker images for TransformerEngine and accelerate as well. Use docker pull huggingface/accelerate@gpu-fp8-transformerengine to quickly get an environment going.

`torchpippy` no more, long live `torch.distributed.pipelining`

With the latest PyTorch release, torchpippy is now fully integrated into torch core, and as a result we are exclusively supporting the PyTorch implementation from now on
There are breaking examples and changes that comes from this shift. Namely:
- Tracing of inputs is done with a shape each GPU will see, rather than the size of the total batch. So for 2 GPUs, one should pass in an input of [1, n, n] rather than [2, n, n] as before.
- We no longer support Encoder/Decoder models. PyTorch tracing for pipelining no longer supports encoder/decoder models, so the t5 example has been removed.
- Computer vision model support currently does not work: There are some tracing issues regarding resnet's we are actively looking into.
If either of these changes are too breaking, we recommend pinning your accelerate version. If the encoder/decoder model support is actively blocking your inference using pippy, please open an issue and let us know. We can look towards adding in the old support for torchpippy potentially if needed.

Fully Sharded Data Parallelism (FSDP)

Environment Flexibility: Environment variables are now fully optional for FSDP, simplifying configuration. You can now fully create a FullyShardedDataParallelPlugin yourself manually with no need for environment patching:

from accelerate import FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(...)

FSDP RAM efficient loading: Added a utility to enable RAM-efficient model loading (by setting the proper environmental variable). This is generally needed if not using accelerate launch and need to ensure the env variables are setup properly for model loading:

from accelerate.utils import enable_fsdp_ram_efficient_loading, disable_fsdp_ram_efficient_loading
enable_fsdp_ram_efficient_loading()

Model State Dict Management: Enhanced support for unwrapping model state dicts in FSDP, making it easier to manage distributed models.

New Examples

Configuration and Models: Improved configuration handling and introduced a configuration zoo for easier experimentation. You can learn more here. This was largely inspired by the axolotl library, so very big kudos to their wonderful work
FSDP + SLURM Example: Added a minimal configuration example for running jobs with SLURM and using FSDP

Bug Fixes

Fix bug of clip_grad_norm_ for xla fsdp by @hanwen-sun in https://github.com/huggingface/accelerate/pull/2941
Explicit check for step when loading the state by @muellerzr in https://github.com/huggingface/accelerate/pull/2992
Fix find_tied_params for models with shared layers by @qubvel in https://github.com/huggingface/accelerate/pull/2986
clear memory after offload by @SunMarc in https://github.com/huggingface/accelerate/pull/2994
fix default value for rank size in cpu threads_per_process assignment logic by @rbrugaro in https://github.com/huggingface/accelerate/pull/3009
Fix batch_sampler maybe None error by @candlewill in https://github.com/huggingface/accelerate/pull/3025
Do not import transformer_engine on import by @oraluben in https://github.com/huggingface/accelerate/pull/3056
Fix torchvision to be compatible with torch version in CI by @SunMarc in https://github.com/huggingface/accelerate/pull/2982
Fix gated test by @muellerzr in https://github.com/huggingface/accelerate/pull/2993
Fix typo on warning str: "on the meta device device" -> "on the meta device" by @HeAndres in https://github.com/huggingface/accelerate/pull/2997
Fix deepspeed tests by @muellerzr in https://github.com/huggingface/accelerate/pull/3003
Fix torch version check by @muellerzr in https://github.com/huggingface/accelerate/pull/3024
Fix fp8 benchmark on single GPU by @muellerzr in https://github.com/huggingface/accelerate/pull/3032
Fix typo in comment by @zmoki688 in https://github.com/huggingface/accelerate/pull/3045
Speed up tests by shaving off subprocess when not needed by @muellerzr in https://github.com/huggingface/accelerate/pull/3042
Remove skip_first_batches support for StatefulDataloader and fix all the tests by @muellerzr in https://github.com/huggingface/accelerate/pull/3068

New Contributors

@byi8220 made their first contribution in https://github.com/huggingface/accelerate/pull/2957
@alex-jw-brooks made their first contribution in https://github.com/huggingface/accelerate/pull/2959
@XciD made their first contribution in https://github.com/huggingface/accelerate/pull/2981
@hanwen-sun made their first contribution in https://github.com/huggingface/accelerate/pull/2941
@HeAndres made their first contribution in https://github.com/huggingface/accelerate/pull/2997
@yitongh made their first contribution in https://github.com/huggingface/accelerate/pull/2966
@qubvel made their first contribution in https://github.com/huggingface/accelerate/pull/2986
@rbrugaro made their first contribution in https://github.com/huggingface/accelerate/pull/3009
@candlewill made their first contribution in https://github.com/huggingface/accelerate/pull/3025
@siddk made their first contribution in https://github.com/huggingface/accelerate/pull/3047
@oraluben made their first contribution in https://github.com/huggingface/accelerate/pull/3056
@tmm1 made their first contribution in https://github.com/huggingface/accelerate/pull/3055
@zmoki688 made their first contribution in https://github.com/huggingface/accelerate/pull/3045

Full Changelog:

Require safetensors>=0.4.3 by @byi8220 in https://github.com/huggingface/accelerate/pull/2957
Fix torchvision to be compatible with torch version in CI by @SunMarc in https://github.com/huggingface/accelerate/pull/2982
Enable Unwrapping for Model State Dicts (FSDP) by @alex-jw-brooks in https://github.com/huggingface/accelerate/pull/2959
chore: Update runs-on configuration for CI workflows by @XciD in https://github.com/huggingface/accelerate/pull/2981
add MLU devices for rng state saving and loading. by @huismiling in https://github.com/huggingface/accelerate/pull/2940
remove .md to allow proper linking by @nbroad1881 in https://github.com/huggingface/accelerate/pull/2977
Fix bug of clip_grad_norm_ for xla fsdp by @hanwen-sun in https://github.com/huggingface/accelerate/pull/2941
Fix gated test by @muellerzr in https://github.com/huggingface/accelerate/pull/2993
Explicit check for step when loading the state by @muellerzr in https://github.com/huggingface/accelerate/pull/2992
Fix typo on warning str: "on the meta device device" -> "on the meta device" by @HeAndres in https://github.com/huggingface/accelerate/pull/2997
Support skip_first_batches for XLA by @yitongh in https://github.com/huggingface/accelerate/pull/2966
clear memory after offload by @SunMarc in https://github.com/huggingface/accelerate/pull/2994
Fix deepspeed tests by @muellerzr in https://github.com/huggingface/accelerate/pull/3003
Make env variables optional for FSDP by @muellerzr in https://github.com/huggingface/accelerate/pull/2998
Add small util to enable FSDP offloading quickly by @muellerzr in https://github.com/huggingface/accelerate/pull/3006
update version to 0.34.dev0 by @SunMarc in https://github.com/huggingface/accelerate/pull/3007
Fix find_tied_params for models with shared layers by @qubvel in https://github.com/huggingface/accelerate/pull/2986
Enable FSDP & Deepspeed + FP8 by @muellerzr in https://github.com/huggingface/accelerate/pull/2983
fix default value for rank size in cpu threads_per_process assignment logic by @rbrugaro in https://github.com/huggingface/accelerate/pull/3009
Wrong import check for TE by @muellerzr in https://github.com/huggingface/accelerate/pull/3016
destroy process group in end_training by @SunMarc in https://github.com/huggingface/accelerate/pull/3012
Tweak defaults for quantized-typed FP8 TE weights by @muellerzr in https://github.com/huggingface/accelerate/pull/3018
Set correct NPU backend and distributed_type when using transfer_to_npu by @ArthurinRUC in https://github.com/huggingface/accelerate/pull/3021
Fix torch version check by @muellerzr in https://github.com/huggingface/accelerate/pull/3024
Add end_training/destroy_pg to everything and unpin numpy by @muellerzr in https://github.com/huggingface/accelerate/pull/3030
Improve config handling and add a zoo by @muellerzr in https://github.com/huggingface/accelerate/pull/3029
Add early support for torchdata.stateful_dataloader.StatefulDataLoader within the Accelerator by @byi8220 in https://github.com/huggingface/accelerate/pull/2895
Fix fp8 benchmark on single GPU by @muellerzr in https://github.com/huggingface/accelerate/pull/3032
Fix batch_sampler maybe None error by @candlewill in https://github.com/huggingface/accelerate/pull/3025
Fixup dataloader state dict bugs + incorporate load/save_state API by @muellerzr in https://github.com/huggingface/accelerate/pull/3034
Decouple prepare_data_loader() from Accelerator by @siddk in https://github.com/huggingface/accelerate/pull/3047
Update CONTRIBUTING.md Setup Instructions by @siddk in https://github.com/huggingface/accelerate/pull/3046
Add a SLURM example with minimal config by @muellerzr in https://github.com/huggingface/accelerate/pull/2950
Add FP8 docker images by @muellerzr in https://github.com/huggingface/accelerate/pull/3048
Update torchpippy by @muellerzr in https://github.com/huggingface/accelerate/pull/2938
Do not import transformer_engine on import by @oraluben in https://github.com/huggingface/accelerate/pull/3056
use duck-typing to ensure underlying optimizer supports schedulefree hooks by @tmm1 in https://github.com/huggingface/accelerate/pull/3055
Fix typo in comment by @zmoki688 in https://github.com/huggingface/accelerate/pull/3045
add set_epoch for MpDeviceLoaderWrapper by @hanwen-sun in https://github.com/huggingface/accelerate/pull/3053
Speed up tests by shaving off subprocess when not needed by @muellerzr in https://github.com/huggingface/accelerate/pull/3042
Remove skip_first_batches support for StatefulDataloader and fix all the tests by @muellerzr in https://github.com/huggingface/accelerate/pull/3068

Detailed Full Changelog:

https://github.com/huggingface/accelerate/compare/v0.33.0...v0.34.0

Aug 8, 2024

v0.33.0: MUSA backend support and bugfixes

MUSA backend support and bugfixes

Small release this month, with key focuses on some added support for backends and bugs:

Support MUSA (Moore Threads GPU) backend in accelerate by @fmo-mt in https://github.com/huggingface/accelerate/pull/2917
Allow multiple process per device by @cifkao in https://github.com/huggingface/accelerate/pull/2916
Add torch.float8_e4m3fn format dtype_byte_size by @SunMarc in https://github.com/huggingface/accelerate/pull/2945
Properly handle Params4bit in set_module_tensor_to_device by @matthewdouglas in https://github.com/huggingface/accelerate/pull/2934

What's Changed

[tests] fix bug in torch_device by @faaany in https://github.com/huggingface/accelerate/pull/2909
Fix slowdown on init with device_map="auto" by @muellerzr in https://github.com/huggingface/accelerate/pull/2914
fix: bug where multi_gpu was being set and warning being printed even with num_processes=1 by @HarikrishnanBalagopal in https://github.com/huggingface/accelerate/pull/2921
Better error when a bad directory is given for weight merging by @muellerzr in https://github.com/huggingface/accelerate/pull/2852
add xpu device check before moving tensor directly to xpu device by @faaany in https://github.com/huggingface/accelerate/pull/2928
Add huggingface_hub version to setup.py by @nullquant in https://github.com/huggingface/accelerate/pull/2932
Correct loading of models with shared tensors when using accelerator.load_state() by @jkuntzer in https://github.com/huggingface/accelerate/pull/2875
Hotfix PyTorch Version Installation in CI Workflow for Minimum Version Matrix by @yhna940 in https://github.com/huggingface/accelerate/pull/2889
Fix import test by @muellerzr in https://github.com/huggingface/accelerate/pull/2931
Consider pynvml available when installed through the nvidia-ml-py distribution by @matthewdouglas in https://github.com/huggingface/accelerate/pull/2936
Improve test reliability for Accelerator.free_memory() by @matthewdouglas in https://github.com/huggingface/accelerate/pull/2935
delete CCL env var setting by @Liangliang-Ma in https://github.com/huggingface/accelerate/pull/2927
feat(ci): add pip caching in CI by @SauravMaheshkar in https://github.com/huggingface/accelerate/pull/2952

New Contributors

@HarikrishnanBalagopal made their first contribution in https://github.com/huggingface/accelerate/pull/2921
@fmo-mt made their first contribution in https://github.com/huggingface/accelerate/pull/2917
@nullquant made their first contribution in https://github.com/huggingface/accelerate/pull/2932
@cifkao made their first contribution in https://github.com/huggingface/accelerate/pull/2916
@jkuntzer made their first contribution in https://github.com/huggingface/accelerate/pull/2875
@matthewdouglas made their first contribution in https://github.com/huggingface/accelerate/pull/2936
@Liangliang-Ma made their first contribution in https://github.com/huggingface/accelerate/pull/2927
@SauravMaheshkar made their first contribution in https://github.com/huggingface/accelerate/pull/2952

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.32.1...v0.33.0

Jul 3, 2024

v0.32.0: Profilers, new hooks, speedups, and more!

Core

Utilize shard saving from the huggingface_hub rather than our own implementation (https://github.com/huggingface/accelerate/pull/2795)
Refactor logging to use logger in dispatch_model (https://github.com/huggingface/accelerate/pull/2855)
The Accelerator.step number is now restored when using save_state and load_state (https://github.com/huggingface/accelerate/pull/2765)
A new profiler has been added allowing users to collect performance metrics during model training and inference, including detailed analysis of execution time and memory consumption. These can then be generated in Chrome's tracing tool. Read more about it here (https://github.com/huggingface/accelerate/pull/2883)
Reduced import times for doing import accelerate and any other major core import by 68%, now should be only slightly longer than doing import torch (https://github.com/huggingface/accelerate/pull/2845)
Fixed a bug in get_backend and added a clear_device_cache utility (https://github.com/huggingface/accelerate/pull/2857)

Distributed Data Parallelism

Introduce DDP communication hooks to have more flexibility in how gradients are communicated across workers, overriding the standard allreduce. (https://github.com/huggingface/accelerate/pull/2841)
Make log_line_prefix_template optional the notebook_launcher (https://github.com/huggingface/accelerate/pull/2888)

FSDP

If the output directory doesn't exist when using accelerate merge-weights, one will be automatically created (https://github.com/huggingface/accelerate/pull/2854)
When merging weights, the default is now .safetensors (https://github.com/huggingface/accelerate/pull/2853)

XPU

Migrate to pytorch's native XPU backend on torch>=2.4 (https://github.com/huggingface/accelerate/pull/2825)
Add @require_triton test decorator and enable test_dynamo work on xpu (https://github.com/huggingface/accelerate/pull/2878)
Fixed load_state_dict not working on xpu and refine xpu safetensors version check (https://github.com/huggingface/accelerate/pull/2879)

XLA

Added support for XLA Dynamo backends for both training and inference (https://github.com/huggingface/accelerate/pull/2892)

Examples

Added a new multi-cpu SLURM example using accelerate launch (https://github.com/huggingface/accelerate/pull/2902)

Full Changelog

Use shard saving from huggingface_hub by @SunMarc in https://github.com/huggingface/accelerate/pull/2795
doc: fix link by @imba-tjd in https://github.com/huggingface/accelerate/pull/2844
Revert "Slight rename" by @SunMarc in https://github.com/huggingface/accelerate/pull/2850
remove warning hook addede during dispatch_model by @SunMarc in https://github.com/huggingface/accelerate/pull/2843
Remove underlines between badges by @novialriptide in https://github.com/huggingface/accelerate/pull/2851
Auto create dir when merging FSDP weights by @helloworld1 in https://github.com/huggingface/accelerate/pull/2854
Add DDP Communication Hooks by @yhna940 in https://github.com/huggingface/accelerate/pull/2841
Refactor logging to use logger in dispatch_model by @panjd123 in https://github.com/huggingface/accelerate/pull/2855
xpu: support xpu backend from stock pytorch (>=2.4) by @dvrogozh in https://github.com/huggingface/accelerate/pull/2825
Drop torch re-imports in npu and mlu paths by @dvrogozh in https://github.com/huggingface/accelerate/pull/2856
Default FSDP weights merge to safetensors by @helloworld1 in https://github.com/huggingface/accelerate/pull/2853
[tests] fix bug in test_tracking.ClearMLTest by @faaany in https://github.com/huggingface/accelerate/pull/2863
[tests] use torch_device instead of 0 for device check by @faaany in https://github.com/huggingface/accelerate/pull/2861
[tests] skip bnb-related tests instead of failing on xpu by @faaany in https://github.com/huggingface/accelerate/pull/2860
Potentially fix tests by @muellerzr in https://github.com/huggingface/accelerate/pull/2862
[tests] enable XPU backend for test_zero3_integration by @faaany in https://github.com/huggingface/accelerate/pull/2864
Support saving and loading of step while saving and loading state by @bipinKrishnan in https://github.com/huggingface/accelerate/pull/2765
Add Profiler Support for Performance Analysis by @yhna940 in https://github.com/huggingface/accelerate/pull/2883
Speed up imports and add a CI by @muellerzr in https://github.com/huggingface/accelerate/pull/2845
Make log_line_prefix_template Optional in Elastic Launcher for Backward Compatibility by @yhna940 in https://github.com/huggingface/accelerate/pull/2888
Add XLA Dynamo backends for training and inference by @johnsutor in https://github.com/huggingface/accelerate/pull/2892
Added a MultiCPU SLURM example using Accelerate Launch and MPIRun by @okhleif-IL in https://github.com/huggingface/accelerate/pull/2902
make more cuda-only tests device-agnostic by @faaany in https://github.com/huggingface/accelerate/pull/2876
fix mlu device longTensor bugs by @huismiling in https://github.com/huggingface/accelerate/pull/2887
add require_triton and enable test_dynamo work on xpu by @faaany in https://github.com/huggingface/accelerate/pull/2878
fix load_state_dict for xpu and refine xpu safetensor version check by @faaany in https://github.com/huggingface/accelerate/pull/2879
Fix get_backend bug and add clear_device_cache function by @NurmaU in https://github.com/huggingface/accelerate/pull/2857

New Contributors

@McPatate made their first contribution in https://github.com/huggingface/accelerate/pull/2836
@imba-tjd made their first contribution in https://github.com/huggingface/accelerate/pull/2844
@novialriptide made their first contribution in https://github.com/huggingface/accelerate/pull/2851
@panjd123 made their first contribution in https://github.com/huggingface/accelerate/pull/2855
@dvrogozh made their first contribution in https://github.com/huggingface/accelerate/pull/2825
@johnsutor made their first contribution in https://github.com/huggingface/accelerate/pull/2892
@okhleif-IL made their first contribution in https://github.com/huggingface/accelerate/pull/2902
@NurmaU made their first contribution in https://github.com/huggingface/accelerate/pull/2857

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.31.0...v0.32.0

Jun 7, 2024

v0.31.0: Better support for sharded state dict with FSDP and Bugfixes

Core

Set timeout default to PyTorch defaults based on backend by @muellerzr in https://github.com/huggingface/accelerate/pull/2758
fix duplicate elements in split_between_processes by @hkunzhe in https://github.com/huggingface/accelerate/pull/2781
Add Elastic Launch Support to notebook_launcher by @yhna940 in https://github.com/huggingface/accelerate/pull/2788
Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in https://github.com/huggingface/accelerate/pull/2790

FSDP

Introduce shard-merging util for FSDP by @muellerzr in https://github.com/huggingface/accelerate/pull/2772
Enable sharded state dict + offload to cpu resume by @muellerzr in https://github.com/huggingface/accelerate/pull/2762
Enable config for fsdp activation checkpointing by @helloworld1 in https://github.com/huggingface/accelerate/pull/2779

Megatron

Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in https://github.com/huggingface/accelerate/pull/2501

What's Changed

Add feature to allow redirecting std streams into log files when using torchrun as the launcher. by @lyuwen in https://github.com/huggingface/accelerate/pull/2740
Update modeling.py by adding try-catch section to skip the unavailable devices by @MeVeryHandsome in https://github.com/huggingface/accelerate/pull/2681
Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity by @statelesshz in https://github.com/huggingface/accelerate/pull/2748
Fix stacklevel in logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions by @luowyang in https://github.com/huggingface/accelerate/pull/2730
LOMO / FIX: Support multiple optimizers by @younesbelkada in https://github.com/huggingface/accelerate/pull/2745
Fix max_memory assignment by @SunMarc in https://github.com/huggingface/accelerate/pull/2751
Fix duplicate environment variable check in multi-cpu condition by @yhna940 in https://github.com/huggingface/accelerate/pull/2752
Simplify CLI args validation and ensure CLI args take precedence over config file. by @Iain-S in https://github.com/huggingface/accelerate/pull/2757
Fix sagemaker config by @muellerzr in https://github.com/huggingface/accelerate/pull/2753
fix cpu omp num threads set by @jiqing-feng in https://github.com/huggingface/accelerate/pull/2755
Revert "Simplify CLI args validation and ensure CLI args take precedence over config file." by @muellerzr in https://github.com/huggingface/accelerate/pull/2763
Enable sharded cpu resume by @muellerzr in https://github.com/huggingface/accelerate/pull/2762
Sets default to PyTorch defaults based on backend by @muellerzr in https://github.com/huggingface/accelerate/pull/2758
optimize get_module_leaves speed by @BBuf in https://github.com/huggingface/accelerate/pull/2756
fix minor typo by @TemryL in https://github.com/huggingface/accelerate/pull/2767
Fix small edge case in get_module_leaves by @SunMarc in https://github.com/huggingface/accelerate/pull/2774
Skip deepspeed test by @SunMarc in https://github.com/huggingface/accelerate/pull/2776
Enable config for fsdp activation checkpointing by @helloworld1 in https://github.com/huggingface/accelerate/pull/2779
Add arg from CLI to fix failing test by @muellerzr in https://github.com/huggingface/accelerate/pull/2783
Skip tied weights disk offload test by @SunMarc in https://github.com/huggingface/accelerate/pull/2782
Introduce shard-merging util for FSDP by @muellerzr in https://github.com/huggingface/accelerate/pull/2772
FIX / FSDP : Guard fsdp utils for earlier PyTorch versions by @younesbelkada in https://github.com/huggingface/accelerate/pull/2794
Upgrade huggingface's megatron to nvidia's megatron when use MegatronLMPlugin by @zhangsheng377 in https://github.com/huggingface/accelerate/pull/2501
Fixup CLI test by @muellerzr in https://github.com/huggingface/accelerate/pull/2796
fix duplicate elements in split_between_processes by @hkunzhe in https://github.com/huggingface/accelerate/pull/2781
Add Elastic Launch Support to notebook_launcher by @yhna940 in https://github.com/huggingface/accelerate/pull/2788
Fix Wrong use of sync_gradients used to implement sync_each_batch by @fabianlim in https://github.com/huggingface/accelerate/pull/2790
Fix type in accelerator.py by @qgallouedec in https://github.com/huggingface/accelerate/pull/2800
fix comet ml test by @SunMarc in https://github.com/huggingface/accelerate/pull/2804
New template by @muellerzr in https://github.com/huggingface/accelerate/pull/2808
Fix access error for torch.mps when using torch==1.13.1 on macOS by @SunMarc in https://github.com/huggingface/accelerate/pull/2806
4-bit quantization meta device bias loading bug by @SunMarc in https://github.com/huggingface/accelerate/pull/2805
State dictionary retrieval from offloaded modules by @blbadger in https://github.com/huggingface/accelerate/pull/2619
add cuda dep for a test by @SunMarc in https://github.com/huggingface/accelerate/pull/2820
Remove out-dated xpu device check code in get_balanced_memory by @faaany in https://github.com/huggingface/accelerate/pull/2826
Fix DeepSpeed config validation error by changing stage3_prefetch_bucket_size value to an integer by @adk9 in https://github.com/huggingface/accelerate/pull/2814
Improve test speeds by up to 30% in multi-gpu settings by @muellerzr in https://github.com/huggingface/accelerate/pull/2830
monitor-interval, take 2 by @muellerzr in https://github.com/huggingface/accelerate/pull/2833
Optimize the megatron plugin by @zhangsheng377 in https://github.com/huggingface/accelerate/pull/2822
fix fstr format by @Jintao-Huang in https://github.com/huggingface/accelerate/pull/2810

New Contributors

@lyuwen made their first contribution in https://github.com/huggingface/accelerate/pull/2740
@MeVeryHandsome made their first contribution in https://github.com/huggingface/accelerate/pull/2681
@luowyang made their first contribution in https://github.com/huggingface/accelerate/pull/2730
@Iain-S made their first contribution in https://github.com/huggingface/accelerate/pull/2757
@BBuf made their first contribution in https://github.com/huggingface/accelerate/pull/2756
@TemryL made their first contribution in https://github.com/huggingface/accelerate/pull/2767
@helloworld1 made their first contribution in https://github.com/huggingface/accelerate/pull/2779
@hkunzhe made their first contribution in https://github.com/huggingface/accelerate/pull/2781
@adk9 made their first contribution in https://github.com/huggingface/accelerate/pull/2814
@Jintao-Huang made their first contribution in https://github.com/huggingface/accelerate/pull/2810

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.30.1...v0.31.0

May 10, 2024

v0.30.1: Bugfixes

Patchfix

Fix duplicate environment variable check in multi-cpu condition thanks to @yhna940 in https://github.com/huggingface/accelerate/pull/2752
Fix issue with missing values in the SageMaker config leading to not being able to launch in https://github.com/huggingface/accelerate/pull/2753
Fix CPU OMP num threads setting thanks to @jiqing-feng in https://github.com/huggingface/accelerate/pull/2755
Fix FSDP checkpoint unable to resume when using offloading and sharded weights due to CUDA OOM when loading the optimizer and model https://github.com/huggingface/accelerate/pull/2762
Fixed the problem of incorrect conditional judgment statement when configuring enable_cpu_affinity thanks to @statelesshz in https://github.com/huggingface/accelerate/pull/2748
Fix stacklevel in logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions thanks to @luowyang in https://github.com/huggingface/accelerate/pull/2730
Fix support for multiple optimizers when using LOMO thanks to @younesbelkada in https://github.com/huggingface/accelerate/pull/2745

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.30.0...v0.30.1

May 3, 2024

v0.30.0: Advanced optimizer support, MoE DeepSpeed support, add upcasting for FSDP, and more

Core

We've simplified the tqdm wrapper to make it fully passthrough, no need to have tqdm(main_process_only, *args), it is now just tqdm(*args) and you can pass in is_main_process as a kwarg.
We've added support for advanced optimizer usage:
- Schedule free optimizer introduced by Meta by @muellerzr in https://github.com/huggingface/accelerate/pull/2631
- LOMO optimizer introduced by OpenLMLab by @younesbelkada in https://github.com/huggingface/accelerate/pull/2695
Enable BF16 autocast to everything during FP8 and enable FSDP by @muellerzr in https://github.com/huggingface/accelerate/pull/2655
Support dataloader send_to_device calls to use non-blocking by @drhead in https://github.com/huggingface/accelerate/pull/2685
allow gather_for_metrics to be more flexible by @SunMarc in https://github.com/huggingface/accelerate/pull/2710
Add cann version info to command accelerate env for NPU by @statelesshz in https://github.com/huggingface/accelerate/pull/2689
Add MLU rng state setter by @ArthurinRUC in https://github.com/huggingface/accelerate/pull/2664
device agnostic testing for hooks&utils&big_modeling by @statelesshz in https://github.com/huggingface/accelerate/pull/2602

Documentation

Through collaboration between @fabianlim (lead contribuitor), @stas00, @pacman100, and @muellerzr we have a new concept guide out for FSDP and DeepSpeed explicitly detailing how each interop and explaining fully and clearly how each of those work. This was a momumental effort by @fabianlim to ensure that everything can be as accurate as possible to users. I highly recommend visiting this new documentation, available here
New distributed inference examples have been added thanks to @SunMarc in https://github.com/huggingface/accelerate/pull/2672
Fixed some docs for using internal trackers by @brentyi in https://github.com/huggingface/accelerate/pull/2650

DeepSpeed

Accelerate can now handle MoE models when using deepspeed, thanks to @pacman100 in https://github.com/huggingface/accelerate/pull/2662
Allow "auto" for gradient clipping in YAML by @regisss in https://github.com/huggingface/accelerate/pull/2649
Introduce a deepspeed-specific Docker image by @muellerzr in https://github.com/huggingface/accelerate/pull/2707. To use, pull the gpu-deepspeed tag docker pull huggingface/accelerate:cuda-deepspeed-nightly

Megatron

Megatron plugin can support NPU by @zhangsheng377 in https://github.com/huggingface/accelerate/pull/2667

Big Modeling

Add strict arg to load_checkpoint_and_dispatch by @SunMarc in https://github.com/huggingface/accelerate/pull/2641

Bug Fixes

Fix up state with xla + performance regression by @muellerzr in https://github.com/huggingface/accelerate/pull/2634
Parenthesis on xpu_available by @muellerzr in https://github.com/huggingface/accelerate/pull/2639
Fix is_train_batch_min type in DeepSpeedPlugin by @yhna940 in https://github.com/huggingface/accelerate/pull/2646
Fix backend check by @jiqing-feng in https://github.com/huggingface/accelerate/pull/2652
Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in https://github.com/huggingface/accelerate/pull/2694
Block AMP for MPS device by @SunMarc in https://github.com/huggingface/accelerate/pull/2699
Fixed issue when doing multi-gpu training with bnb when the first gpu is not used by @SunMarc in https://github.com/huggingface/accelerate/pull/2714
Fixup free_memory to deal with garbage collection by @muellerzr in https://github.com/huggingface/accelerate/pull/2716
Fix sampler serialization failing by @SunMarc in https://github.com/huggingface/accelerate/pull/2723
Fix deepspeed offload device type in the arguments to be more accurate by @yhna940 in https://github.com/huggingface/accelerate/pull/2717

Full Changelog

Schedule free optimizer support by @muellerzr in https://github.com/huggingface/accelerate/pull/2631
Fix up state with xla + performance regression by @muellerzr in https://github.com/huggingface/accelerate/pull/2634
Parenthesis on xpu_available by @muellerzr in https://github.com/huggingface/accelerate/pull/2639
add third-party device prefix to execution_device by @faaany in https://github.com/huggingface/accelerate/pull/2612
add strict arg to load_checkpoint_and_dispatch by @SunMarc in https://github.com/huggingface/accelerate/pull/2641
device agnostic testing for hooks&utils&big_modeling by @statelesshz in https://github.com/huggingface/accelerate/pull/2602
Docs fix for using internal trackers by @brentyi in https://github.com/huggingface/accelerate/pull/2650
Allow "auto" for gradient clipping in YAML by @regisss in https://github.com/huggingface/accelerate/pull/2649
Fix is_train_batch_min type in DeepSpeedPlugin by @yhna940 in https://github.com/huggingface/accelerate/pull/2646
Don't use deprecated Repository anymore by @Wauplin in https://github.com/huggingface/accelerate/pull/2658
Fix test_from_pretrained_low_cpu_mem_usage_measured failure by @yuanwu2017 in https://github.com/huggingface/accelerate/pull/2644
Add MLU rng state setter by @ArthurinRUC in https://github.com/huggingface/accelerate/pull/2664
fix backend check by @jiqing-feng in https://github.com/huggingface/accelerate/pull/2652
Megatron plugin can support NPU by @zhangsheng377 in https://github.com/huggingface/accelerate/pull/2667
Revert "fix backend check" by @muellerzr in https://github.com/huggingface/accelerate/pull/2669
tqdm: *args should come ahead of main_process_only by @rb-synth in https://github.com/huggingface/accelerate/pull/2654
Handle MoE models with DeepSpeed by @pacman100 in https://github.com/huggingface/accelerate/pull/2662
Fix deepspeed moe test with version check by @pacman100 in https://github.com/huggingface/accelerate/pull/2677
Pin DS...again.. by @muellerzr in https://github.com/huggingface/accelerate/pull/2679
fix backend check by @jiqing-feng in https://github.com/huggingface/accelerate/pull/2670
Deprecate tqdm args + slight logic tweaks by @muellerzr in https://github.com/huggingface/accelerate/pull/2673
Enable BF16 autocast to everything during FP8 + some tweaks to enable FSDP by @muellerzr in https://github.com/huggingface/accelerate/pull/2655
Fix the rng states of sampler's generator to be synchronized for correct sharding of dataset across GPUs by @pacman100 in https://github.com/huggingface/accelerate/pull/2694
Simplify test logic by @pacman100 in https://github.com/huggingface/accelerate/pull/2697
Add source code for DataLoader Animation by @muellerzr in https://github.com/huggingface/accelerate/pull/2696
Block AMP for MPS device by @SunMarc in https://github.com/huggingface/accelerate/pull/2699
Do a pip freeze during workflows by @muellerzr in https://github.com/huggingface/accelerate/pull/2704
add cann version info to command accelerate env by @statelesshz in https://github.com/huggingface/accelerate/pull/2689
Add version checks for the import of DeepSpeed moe utils by @pacman100 in https://github.com/huggingface/accelerate/pull/2705
Change dataloader send_to_device calls to non-blocking by @drhead in https://github.com/huggingface/accelerate/pull/2685
add distributed examples by @SunMarc in https://github.com/huggingface/accelerate/pull/2672
Add diffusers to req by @muellerzr in https://github.com/huggingface/accelerate/pull/2711
fix bnb multi gpu training by @SunMarc in https://github.com/huggingface/accelerate/pull/2714
allow gather_for_metrics to be more flexible by @SunMarc in https://github.com/huggingface/accelerate/pull/2710
Add Upcasting for FSDP in Mixed Precision. Add Concept Guide for FSPD and DeepSpeed. by @fabianlim in https://github.com/huggingface/accelerate/pull/2674
Segment out a deepspeed docker image by @muellerzr in https://github.com/huggingface/accelerate/pull/2707
Fixup free_memory to deal with garbage collection by @muellerzr in https://github.com/huggingface/accelerate/pull/2716
fix sampler serialization by @SunMarc in https://github.com/huggingface/accelerate/pull/2723
Fix sampler failing test by @SunMarc in https://github.com/huggingface/accelerate/pull/2728
Docs: Fix build main documentation by @SunMarc in https://github.com/huggingface/accelerate/pull/2729
Fix Documentation in FSDP and DeepSpeed Concept Guide by @fabianlim in https://github.com/huggingface/accelerate/pull/2725
Fix deepspeed offload device type by @yhna940 in https://github.com/huggingface/accelerate/pull/2717
FEAT: Add LOMO optimizer by @younesbelkada in https://github.com/huggingface/accelerate/pull/2695
Fix tests on main by @muellerzr in https://github.com/huggingface/accelerate/pull/2739

New Contributors

@brentyi made their first contribution in https://github.com/huggingface/accelerate/pull/2650
@regisss made their first contribution in https://github.com/huggingface/accelerate/pull/2649
@yhna940 made their first contribution in https://github.com/huggingface/accelerate/pull/2646
@Wauplin made their first contribution in https://github.com/huggingface/accelerate/pull/2658
@ArthurinRUC made their first contribution in https://github.com/huggingface/accelerate/pull/2664
@jiqing-feng made their first contribution in https://github.com/huggingface/accelerate/pull/2652
@zhangsheng377 made their first contribution in https://github.com/huggingface/accelerate/pull/2667
@rb-synth made their first contribution in https://github.com/huggingface/accelerate/pull/2654
@drhead made their first contribution in https://github.com/huggingface/accelerate/pull/2685

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.29.3...v0.30.0

Apr 17, 2024

v0.29.3: Patchfix

Fixes issue with backend refactor not working on CPU-based distributed environments by @jiqing-feng: https://github.com/huggingface/accelerate/pull/2670
Fixes issue where load_checkpoint_and_dispatch needs a strict argument
by @SunMarc: https://github.com/huggingface/accelerate/pull/2641

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.29.2...v0.29.3

Apr 9, 2024

v0.29.2: Patchfix

Fixes xpu missing parenthesis https://github.com/huggingface/accelerate/pull/2639
Fixes XLA and performance degradation on init with the state https://github.com/huggingface/accelerate/pull/2634

Apr 5, 2024

v0.29.1: Patchfix

Fixed an import which would cause running accelerate CLI to fail if pytest wasn't installed

v0.29.0: NUMA affinity control, MLU Support, and DeepSpeed Improvements

Core

Accelerate can now optimize NUMA affinity, which can help increase throughput on NVIDIA multi-GPU systems. To enable it either follow the prompt during accelerate config, set the ACCELERATE_CPU_AFFINITY=1 env variable, or manually using the following:

from accelerate.utils import set_numa_affinity

# For GPU 0
set_numa_affinity(0)

Big thanks to @stas00 for the recommendation, request, and feedback during development

Allow for setting deterministic algorithms in set_seed by @muellerzr in https://github.com/huggingface/accelerate/pull/2569
Fixed the test script for TPU v2/v3 by @vanbasten23 in https://github.com/huggingface/accelerate/pull/2542
Cambricon MLU device support introduced by @huismiling in https://github.com/huggingface/accelerate/pull/2552
A big refactor was performed to the PartialState and AcceleratorState to allow for easier future-proofing and simplification of adding new devices by @muellerzr in https://github.com/huggingface/accelerate/pull/2576
Fixed a reproducibility issue in distributed environments with Dataloader shuffling when using BatchSamplerShard by @universuen in https://github.com/huggingface/accelerate/pull/2584
notebook_launcher can use multiple GPUs in Google Colab if using a custom instance that supports multiple GPUs by @StefanTodoran in https://github.com/huggingface/accelerate/pull/2561

Big Model Inference

Add log message for RTX 4000 series when performing multi-gpu inference with device_map which can lead to hanging by @SunMarc in https://github.com/huggingface/accelerate/pull/2557
Fix load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in https://github.com/huggingface/accelerate/pull/2588

DeepSpeed

Fix issue with the mapping of main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in https://github.com/huggingface/accelerate/pull/2495
Improve deepspeed env gen by checking for bad keys, by @muellerzr and @ricklamers in https://github.com/huggingface/accelerate/pull/2565
We now support custom deepspeed env files. Like normal deepspeed, set it with the DS_ENV_FILE environmental variable by @muellerzr in https://github.com/huggingface/accelerate/pull/2566
Resolve ZeRO-3 Initialization Failure in already-started distributed environments by @sword865 in https://github.com/huggingface/accelerate/pull/2578

What's Changed

Fix test_script.py on TPU v2/v3 by @vanbasten23 in https://github.com/huggingface/accelerate/pull/2542
Add mapping main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in https://github.com/huggingface/accelerate/pull/2495
split_between_processes for Dataset by @geronimi73 in https://github.com/huggingface/accelerate/pull/2433
Include working driver check by @muellerzr in https://github.com/huggingface/accelerate/pull/2558
🚨🚨🚨Move to using tags rather than latest for docker images and consolidate image repos 🚨 🚨🚨 by @muellerzr in https://github.com/huggingface/accelerate/pull/2554
Add Cambricon MLU accelerator support by @huismiling in https://github.com/huggingface/accelerate/pull/2552
Add NUMA affinity control for NVIDIA GPUs by @muellerzr in https://github.com/huggingface/accelerate/pull/2535
Add log message for RTX 4000 series when performing multi-gpu inference with device_map by @SunMarc in https://github.com/huggingface/accelerate/pull/2557
Improve deepspeed env gen by @muellerzr in https://github.com/huggingface/accelerate/pull/2565
Allow for setting deterministic algorithms by @muellerzr in https://github.com/huggingface/accelerate/pull/2569
Unpin deepspeed by @muellerzr in https://github.com/huggingface/accelerate/pull/2570
Rm uv install by @muellerzr in https://github.com/huggingface/accelerate/pull/2577
Allow for custom deepspeed env files by @muellerzr in https://github.com/huggingface/accelerate/pull/2566
[docs] Missing functions from API by @stevhliu in https://github.com/huggingface/accelerate/pull/2580
Update data_loader.py to Ensure Reproducibility in Multi-Process Environments with Dataloader Shuffle by @universuen in https://github.com/huggingface/accelerate/pull/2584
Refactor affinity and make it stateful by @muellerzr in https://github.com/huggingface/accelerate/pull/2579
Refactor and improve model estimator tool by @muellerzr in https://github.com/huggingface/accelerate/pull/2581
Fix load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in https://github.com/huggingface/accelerate/pull/2588
Guard stateful objects by @muellerzr in https://github.com/huggingface/accelerate/pull/2572
Expound PartialState docstring by @muellerzr in https://github.com/huggingface/accelerate/pull/2589
[docs] Fix kwarg docstring by @stevhliu in https://github.com/huggingface/accelerate/pull/2590
Allow notebook_launcher to launch to multiple GPUs from Colab by @StefanTodoran in https://github.com/huggingface/accelerate/pull/2561
Fix warning log for unused checkpoint keys by @fxmarty in https://github.com/huggingface/accelerate/pull/2594
Resolve ZeRO-3 Initialization Failure in Pre-Set Torch Distributed Environments (huggingface/transformers#28803) by @sword865 in https://github.com/huggingface/accelerate/pull/2578
Refactor PartialState and AcceleratorState by @muellerzr in https://github.com/huggingface/accelerate/pull/2576
Allow for force unwrapping by @muellerzr in https://github.com/huggingface/accelerate/pull/2595
Pin hub for tests by @muellerzr in https://github.com/huggingface/accelerate/pull/2608
Default false for trust_remote_code by @muellerzr in https://github.com/huggingface/accelerate/pull/2607
fix llama example for pippy by @SunMarc in https://github.com/huggingface/accelerate/pull/2616
Fix links in Quick Tour by @muellerzr in https://github.com/huggingface/accelerate/pull/2617
Link to bash in env reporting by @muellerzr in https://github.com/huggingface/accelerate/pull/2623
Unpin hub by @muellerzr in https://github.com/huggingface/accelerate/pull/2625

New Contributors

@asdfry made their first contribution in https://github.com/huggingface/accelerate/pull/2495
@geronimi73 made their first contribution in https://github.com/huggingface/accelerate/pull/2433
@huismiling made their first contribution in https://github.com/huggingface/accelerate/pull/2552
@universuen made their first contribution in https://github.com/huggingface/accelerate/pull/2584
@StefanTodoran made their first contribution in https://github.com/huggingface/accelerate/pull/2561
@sword865 made their first contribution in https://github.com/huggingface/accelerate/pull/2578

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.28.0...v0.29.0

Mar 12, 2024

v0.28.0: DataLoaderConfig, XLA improvements, FSDP + QLORA foundations, Gradient Synchronization Tweaks, and Bug Fixes

Core

Introduce a DataLoaderConfiguration and begin deprecation of arguments in the Accelerator

+from accelerate import DataLoaderConfiguration
+dl_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dispatch_batches=True)
+accelerator = Accelerator(dataloader_config=dl_config)

Allow gradients to be synced each data batch while performing gradient accumulation, useful when training in FSDP by @fabianlim in https://github.com/huggingface/accelerate/pull/2531

from accelerate import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(
+    num_steps=2, 
    sync_each_batch=sync_each_batch
)
accelerator = Accelerator(gradient_accumulation_plugin=plugin)

Torch XLA

Support for XLA on the GPU by @anw90 in https://github.com/huggingface/accelerate/pull/2176
Enable gradient accumulation on TPU in https://github.com/huggingface/accelerate/pull/2453

FSDP

Support downstream FSDP + QLORA support through tweaks by allowing configuration of buffer precision by @pacman100 in https://github.com/huggingface/accelerate/pull/2544

`launch` changes

Support mpirun for multi-cpu training by @dmsuehir in https://github.com/huggingface/accelerate/pull/2493

What's Changed

Fix model metadata issue check by @muellerzr in https://github.com/huggingface/accelerate/pull/2435
Use py 3.9 by @muellerzr in https://github.com/huggingface/accelerate/pull/2436
Fix seedable sampler logic and expound docs by @muellerzr in https://github.com/huggingface/accelerate/pull/2434
Fix tied_pointers_to_remove type by @fxmarty in https://github.com/huggingface/accelerate/pull/2439
Make test assertions more idiomatic by @akx in https://github.com/huggingface/accelerate/pull/2420
Prefer is_torch_tensor over hasattr for torch.compile. by @PhilJd in https://github.com/huggingface/accelerate/pull/2387
Enable more Ruff lints & fix issues by @akx in https://github.com/huggingface/accelerate/pull/2419
Fix warning when dispatching model by @SunMarc in https://github.com/huggingface/accelerate/pull/2442
Make torch xla available on GPU by @anw90 in https://github.com/huggingface/accelerate/pull/2176
Include pippy_file_path by @muellerzr in https://github.com/huggingface/accelerate/pull/2444
[Big deprecation] Introduces a DataLoaderConfig by @muellerzr in https://github.com/huggingface/accelerate/pull/2441
Check for None by @muellerzr in https://github.com/huggingface/accelerate/pull/2452
Fix the pytest version to be less than 8.0.1 by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/2461
Fix wrong is_namedtuple implementation by @fxmarty in https://github.com/huggingface/accelerate/pull/2475
Use grad-accum on TPU by @muellerzr in https://github.com/huggingface/accelerate/pull/2453
Add pre-commit configuration by @akx in https://github.com/huggingface/accelerate/pull/2451
Replace os.path.sep.join path manipulations with a helper by @akx in https://github.com/huggingface/accelerate/pull/2446
DOC: Fixes to Accelerator docstring by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/2443
Context manager fixes by @akx in https://github.com/huggingface/accelerate/pull/2450
Fix TPU with new XLA device type by @will-cromar in https://github.com/huggingface/accelerate/pull/2467
Free mps memory by @SunMarc in https://github.com/huggingface/accelerate/pull/2483
[FIX] allow Accelerator to detect distributed type from the "LOCAL_RANK" env variable for XPU by @faaany in https://github.com/huggingface/accelerate/pull/2473
Fix CI tests due to pathlib issues by @muellerzr in https://github.com/huggingface/accelerate/pull/2491
Remove all cases of torchrun in tests and centralize as accelerate launch by @muellerzr in https://github.com/huggingface/accelerate/pull/2498
Fix link typo by @SunMarc in https://github.com/huggingface/accelerate/pull/2503
[docs] Accelerator API by @stevhliu in https://github.com/huggingface/accelerate/pull/2465
Docstring fixup by @muellerzr in https://github.com/huggingface/accelerate/pull/2504
[docs] Divide training and inference by @stevhliu in https://github.com/huggingface/accelerate/pull/2466
add custom dtype INT2 by @SunMarc in https://github.com/huggingface/accelerate/pull/2505
quanto compatibility for cpu/disk offload by @SunMarc in https://github.com/huggingface/accelerate/pull/2481
[docs] Quicktour by @stevhliu in https://github.com/huggingface/accelerate/pull/2456
Check if hub down by @muellerzr in https://github.com/huggingface/accelerate/pull/2506
Remove offline stuff by @muellerzr in https://github.com/huggingface/accelerate/pull/2509
Fixed 0MiB bug in convert_file_size_to_int by @StoyanStAtanasov in https://github.com/huggingface/accelerate/pull/2507
Fix edge case in infer_auto_device_map when dealing with buffers by @SunMarc in https://github.com/huggingface/accelerate/pull/2511
[docs] Fix typos by @omahs in https://github.com/huggingface/accelerate/pull/2490
fix typo in launch.py (----main_process_port to --main_process_port) by @DerrickWang005 in https://github.com/huggingface/accelerate/pull/2516
Add copyright + some ruff lint things by @muellerzr in https://github.com/huggingface/accelerate/pull/2523
Don't manage PYTORCH_NVML_BASED_CUDA_CHECK when calling accelerate.utils.imports.is_cuda_available() by @luiscape in https://github.com/huggingface/accelerate/pull/2524
Quanto compatibility with QBitsTensor by @SunMarc in https://github.com/huggingface/accelerate/pull/2526
Remove unnecessary env=os.environ.copy()s by @akx in https://github.com/huggingface/accelerate/pull/2449
Launch mpirun from accelerate launch for multi-CPU training by @dmsuehir in https://github.com/huggingface/accelerate/pull/2493
Enable using dash or underscore for CLI args by @muellerzr in https://github.com/huggingface/accelerate/pull/2527
Update the default behavior of zero_grad(set_to_none=None) to align with PyTorch by @yongchanghao in https://github.com/huggingface/accelerate/pull/2472
Update link to dynamo/compile doc by @WarmongeringBeaver in https://github.com/huggingface/accelerate/pull/2533
Check if the buffers fit GPU memory after device map auto inferred by @notsyncing in https://github.com/huggingface/accelerate/pull/2412
[Refactor] Refactor send_to_device to treat tensor-like first by @vmoens in https://github.com/huggingface/accelerate/pull/2438
Overdue email change... by @muellerzr in https://github.com/huggingface/accelerate/pull/2534
[docs] Troubleshoot by @stevhliu in https://github.com/huggingface/accelerate/pull/2538
Remove extra double-dash in error message by @drscotthawley in https://github.com/huggingface/accelerate/pull/2541
Allow Gradients to be Synced Each Data Batch While Performing Gradient Accumulation by @fabianlim in https://github.com/huggingface/accelerate/pull/2531
Update FSDP mixed precision setter to enable fsdp+qlora by @pacman100 in https://github.com/huggingface/accelerate/pull/2544
Use uv instead of pip install for github CI by @muellerzr in https://github.com/huggingface/accelerate/pull/2546

New Contributors

@anw90 made their first contribution in https://github.com/huggingface/accelerate/pull/2176
@StoyanStAtanasov made their first contribution in https://github.com/huggingface/accelerate/pull/2507
@omahs made their first contribution in https://github.com/huggingface/accelerate/pull/2490
@DerrickWang005 made their first contribution in https://github.com/huggingface/accelerate/pull/2516
@luiscape made their first contribution in https://github.com/huggingface/accelerate/pull/2524
@dmsuehir made their first contribution in https://github.com/huggingface/accelerate/pull/2493
@yongchanghao made their first contribution in https://github.com/huggingface/accelerate/pull/2472
@WarmongeringBeaver made their first contribution in https://github.com/huggingface/accelerate/pull/2533
@vmoens made their first contribution in https://github.com/huggingface/accelerate/pull/2438
@drscotthawley made their first contribution in https://github.com/huggingface/accelerate/pull/2541
@fabianlim made their first contribution in https://github.com/huggingface/accelerate/pull/2531

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.27.2...v0.28.0

Feb 9, 2024

v0.27.0: PyTorch 2.2.0 Support, PyTorch-Native Pipeline Parallism, DeepSpeed XPU support, and Bug Fixes

PyTorch 2.2.0 Support

With the latest release of PyTorch 2.2.0, we've guaranteed that there are no breaking changes regarding it

PyTorch-Native Pipeline Parallel Inference

With this release we are excited to announce support for pipeline-parallel inference by integrating PyTorch's PiPPy framework (so no need to use Megatron or DeepSpeed)! This supports automatic model-weight splitting to each device using a similar API to device_map="auto". This is still under heavy development, however the inference side is stable enough that we are ready for a release. Read more about it in our docs and check out the example zoo.

Requires pippy of version 0.2.0 or later (pip install torchpippy -U)

Example usage (combined with accelerate launch or torchrun):

from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model = prepare_pippy(model, split_points="auto", example_args=(input,))
input = input.to("cuda:0")
with torch.no_grad():
    output = model(input)
# The outputs are only on the final process by default
# You can pass in `gather_outputs=True` to prepare_pippy to
# make them available on all processes
if PartialState().is_last_process:
    output = torch.stack(tuple(output[0]))
    print(output.shape)

DeepSpeed

This release provides support for utilizing DeepSpeed on XPU devices thanks to @faaany

What's Changed

Convert model.hf_device_map back to Dict by @SunMarc in https://github.com/huggingface/accelerate/pull/2326
Fix model memory issue by @muellerzr in https://github.com/huggingface/accelerate/pull/2327
Fixed typos in readme files of docs folder. by @rishit5 in https://github.com/huggingface/accelerate/pull/2329
Disable P2P in just the 4000 series by @muellerzr in https://github.com/huggingface/accelerate/pull/2332
Avoid duplicating memory for tied weights in dispatch_model, and in forward with offloading by @fxmarty in https://github.com/huggingface/accelerate/pull/2330
Show DeepSpeed option when multi-XPU is selected in accelerate config by @faaany in https://github.com/huggingface/accelerate/pull/2346
FIX: add oneCCL environment variable for non-MPI launcher (accelerate launch) by @faaany in https://github.com/huggingface/accelerate/pull/2339
device agnostic test_accelerator/test_multigpu by @wangshuai09 in https://github.com/huggingface/accelerate/pull/2343
Fix mpi4py/failing deepspeed test issues by @muellerzr in https://github.com/huggingface/accelerate/pull/2353
Fix block_size picking in megatron_lm_gpt_pretraining example. by @nilq in https://github.com/huggingface/accelerate/pull/2342
Fix dispatch_model with tied weights test on T4 by @fxmarty in https://github.com/huggingface/accelerate/pull/2354
bugfix to allow usage of TE or MSAMP in FP8RecipeKwargs by @sudhakarsingh27 in https://github.com/huggingface/accelerate/pull/2355
Pin DeepSpeed until patch by @muellerzr in https://github.com/huggingface/accelerate/pull/2366
Remove init_hook_kwargs by @fxmarty in https://github.com/huggingface/accelerate/pull/2365
device agnostic optimizer testing by @statelesshz in https://github.com/huggingface/accelerate/pull/2363
add_hook_to_module and remove_hook_from_module compatibility with fx.GraphModule by @fxmarty in https://github.com/huggingface/accelerate/pull/2369
Adding requires_grad to kwargs when registering empty parameters. by @BlackSamorez in https://github.com/huggingface/accelerate/pull/2376
Add adapter_only option to save_fsdp_model and load_fsdp_model to only save/load PEFT weights by @AjayP13 in https://github.com/huggingface/accelerate/pull/2321
device agnostic cli/data_loader/grad_sync/kwargs_handlers/memory_utils testing by @wangshuai09 in https://github.com/huggingface/accelerate/pull/2356
Fix batch_size sanity check logic for split_batches by @izhx in https://github.com/huggingface/accelerate/pull/2344
Pin Torch version to <2.2.0 by @Rocketknight1 in https://github.com/huggingface/accelerate/pull/2394
Address PIP-632 deprecation of distutils by @AieatAssam in https://github.com/huggingface/accelerate/pull/2388
[don't merge yet] unpin torch by @ydshieh in https://github.com/huggingface/accelerate/pull/2406
Revert "[don't merge yet] unpin torch" by @muellerzr in https://github.com/huggingface/accelerate/pull/2407
Fix CI due to pytest by @muellerzr in https://github.com/huggingface/accelerate/pull/2408
Added activateEnviroment.sh to readme by @TJ-Solergibert in https://github.com/huggingface/accelerate/pull/2409
Fix XPU inference by @notsyncing in https://github.com/huggingface/accelerate/pull/2383
Fix the size of int and bool type when computing module size by @notsyncing in https://github.com/huggingface/accelerate/pull/2411
Adding Local SGD support for NPU by @statelesshz in https://github.com/huggingface/accelerate/pull/2415
Unpin torch by @muellerzr in https://github.com/huggingface/accelerate/pull/2418
Use Ruff for formatting too by @akx in https://github.com/huggingface/accelerate/pull/2400
torch-native pipeline parallelism for big models by @muellerzr in https://github.com/huggingface/accelerate/pull/2345
Update FSDP docs by @pacman100 in https://github.com/huggingface/accelerate/pull/2430
Make output end up on all GPUs at the end by @muellerzr in https://github.com/huggingface/accelerate/pull/2423
Migrate pippy examples over and run tests by @muellerzr in https://github.com/huggingface/accelerate/pull/2424
[FIX] fix the wrong nproc_per_node in the multi gpu test by @faaany in https://github.com/huggingface/accelerate/pull/2422
Fix fp8 things by @muellerzr in https://github.com/huggingface/accelerate/pull/2403
[FIX] allow Accelerator to prepare models in eval mode for XPU&CPU by @faaany in https://github.com/huggingface/accelerate/pull/2426
[Fix] make all tests pass on XPU by @faaany in https://github.com/huggingface/accelerate/pull/2427

New Contributors

@rishit5 made their first contribution in https://github.com/huggingface/accelerate/pull/2329
@faaany made their first contribution in https://github.com/huggingface/accelerate/pull/2346
@wangshuai09 made their first contribution in https://github.com/huggingface/accelerate/pull/2343
@nilq made their first contribution in https://github.com/huggingface/accelerate/pull/2342
@BlackSamorez made their first contribution in https://github.com/huggingface/accelerate/pull/2376
@AjayP13 made their first contribution in https://github.com/huggingface/accelerate/pull/2321
@Rocketknight1 made their first contribution in https://github.com/huggingface/accelerate/pull/2394
@AieatAssam made their first contribution in https://github.com/huggingface/accelerate/pull/2388
@ydshieh made their first contribution in https://github.com/huggingface/accelerate/pull/2406
@notsyncing made their first contribution in https://github.com/huggingface/accelerate/pull/2383
@akx made their first contribution in https://github.com/huggingface/accelerate/pull/2400

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.26.1...v0.27.0

Jan 11, 2024

v0.26.1: Patch Release

What's Changed

Raise error when using batches of different sizes with dispatch_batches=True by @SunMarc in https://github.com/huggingface/accelerate/pull/2325

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.26.0...v0.26.1

v0.26.0 - MS-AMP Support, Critical Regression Fixes, and More

Support for MS-AMP

This release adds support for the MS-AMP (Microsoft Automatic Mixed Precision Library) into Accelerate as an alternative backend for doing FP8 training on appropriate hardware. It is the default backend of choice. Read more in the docs here. Introduced in https://github.com/huggingface/accelerate/pull/2232 by @muellerzr

Core

In the prior release a new sampler for the DataLoader was introduced that while across seeds does not show statistical differences in the results, repeating the same seed would result in a different end-accuracy that was scary to some users. We have now disabled this behavior by default as it required some additional setup, and brought back the original implementation. To have the new sampling technique (which can provide more accurate repeated results) pass use_seedable_sampler=True to the Accelerator. We will be propagating this up to the Trainer soon.

Big Model Inference

NPU support was added thanks to @statelesshz in https://github.com/huggingface/accelerate/pull/2222
When generating an automatic device_map we've made it possible to not returned grouped key results if desired in https://github.com/huggingface/accelerate/pull/2233
We now handle corner cases better when users pass device_map="cuda" etc thanks to @younesbelkada in https://github.com/huggingface/accelerate/pull/2254

FSDP and DeepSpeed

Many improvements to the docs have been made thanks to @stass. Along with this we've made it easier to adjust the config for the sharding strategy and other config values thanks to @pacman100 in https://github.com/huggingface/accelerate/pull/2288
A regression in Accelerate 0.23.0 occurred that showed learning is much slower on multi-GPU setups compared to a single GPU. https://github.com/huggingface/accelerate/pull/2304 has now fixed this thanks to @pacman100
The DeepSpeed integration now also handles auto values better when making a configuration in https://github.com/huggingface/accelerate/pull/2313

Bits and Bytes

Params4bit added to bnb classes in set_module_tensor_to_device() by @poedator in https://github.com/huggingface/accelerate/pull/2315

Device Agnostic Testing

For developers, we've made it much easier to run the tests on different devices with no change to the code thanks to @statelesshz in https://github.com/huggingface/accelerate/pull/2123 and https://github.com/huggingface/accelerate/pull/2235

Bug Fixes

Check notebook launcher for 3090+ by @muellerzr in https://github.com/huggingface/accelerate/pull/2212
Fix dtype bug when offload_state_dict=True and dtype is specified by @fxmarty in https://github.com/huggingface/accelerate/pull/2116
fix tqdm wrapper to print when process id ==0 by @kashif in https://github.com/huggingface/accelerate/pull/2223
fix BFloat16 is not supported on MPS (#2226) by @jxysoft in https://github.com/huggingface/accelerate/pull/2227
Fix MpDeviceLoaderWrapper not having attribute batch_sampler by @vanbasten23 in https://github.com/huggingface/accelerate/pull/2242
[deepspeed] fix setting auto values for comm buffers by @stas00 in https://github.com/huggingface/accelerate/pull/2295
Fix infer_auto_device_map when tied weights share the same prefix name by @fxmarty in https://github.com/huggingface/accelerate/pull/2324
Fixes bug in swapping weights when replacing with Transformer-Engine layers by @sudhakarsingh27 in https://github.com/huggingface/accelerate/pull/2305
Fix breakpoint API in test_script.py on TPU. by @vanbasten23 in https://github.com/huggingface/accelerate/pull/2263
Bring old seed technique back by @muellerzr in https://github.com/huggingface/accelerate/pull/2319

Major Contributors

@statelesshz for their work on device-agnostic testing and NPU support
@stas00 for many docfixes when it comes to DeepSpeed and FSDP

General Changelog

add missing whitespace by @stas00 in https://github.com/huggingface/accelerate/pull/2206
MNT Delete the delete doc workflows by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/2217
Update docker images by @muellerzr in https://github.com/huggingface/accelerate/pull/2213
Add allgather check for xpu by @abhilash1910 in https://github.com/huggingface/accelerate/pull/2199
Check notebook launcher for 3090+ by @muellerzr in https://github.com/huggingface/accelerate/pull/2212
Fix dtype bug when offload_state_dict=True and dtype is specified by @fxmarty in https://github.com/huggingface/accelerate/pull/2116
fix tqdm wrapper to print when process id ==0 by @kashif in https://github.com/huggingface/accelerate/pull/2223
[data_loader] expand the error message by @stas00 in https://github.com/huggingface/accelerate/pull/2221
Update the 'Frameworks using Accelerate' section to include Amphion by @RMSnow in https://github.com/huggingface/accelerate/pull/2225
[Docs] Add doc for cpu/disk offload by @SunMarc in https://github.com/huggingface/accelerate/pull/2231
device agnostic testing by @statelesshz in https://github.com/huggingface/accelerate/pull/2123
Make cleaning optional for device map by @muellerzr in https://github.com/huggingface/accelerate/pull/2233
Add npu support to big model inference by @statelesshz in https://github.com/huggingface/accelerate/pull/2222
fix the DS failing test by @pacman100 in https://github.com/huggingface/accelerate/pull/2237
Fix nb tests by @muellerzr in https://github.com/huggingface/accelerate/pull/2230
fix BFloat16 is not supported on MPS (#2226) by @jxysoft in https://github.com/huggingface/accelerate/pull/2227
Fix MpDeviceLoaderWrapper not having attribute batch_sampler by @vanbasten23 in https://github.com/huggingface/accelerate/pull/2242
[Big-Modeling] Harmonize device check to handle corner cases by @younesbelkada in https://github.com/huggingface/accelerate/pull/2254
Support log_images for aim tracker by @Justin900429 in https://github.com/huggingface/accelerate/pull/2257
Integrate MS-AMP Support for FP8 as a seperate backend by @muellerzr in https://github.com/huggingface/accelerate/pull/2232
refactor deepspeed dataloader prepare logic by @pacman100 in https://github.com/huggingface/accelerate/pull/2238
device agnostic deepspeed&fsdp testing by @statelesshz in https://github.com/huggingface/accelerate/pull/2235
Solve CUDA issues by @muellerzr in https://github.com/huggingface/accelerate/pull/2272
Uninstall DVC in the Trainer tests by @muellerzr in https://github.com/huggingface/accelerate/pull/2271
Rm DVCLive from test reqs as latest version causes failures by @muellerzr in https://github.com/huggingface/accelerate/pull/2279
typo fix by @stas00 in https://github.com/huggingface/accelerate/pull/2276
Add condition before using check_tied_parameters_on_same_device by @SunMarc in https://github.com/huggingface/accelerate/pull/2218
[doc] FSDP improvements by @stas00 in https://github.com/huggingface/accelerate/pull/2274
[deepspeed docs] auto-values aren't being covered by @stas00 in https://github.com/huggingface/accelerate/pull/2286
Improve FSDP config usability by @pacman100 in https://github.com/huggingface/accelerate/pull/2288
[doc] language fixes by @stas00 in https://github.com/huggingface/accelerate/pull/2292
Bump tj-actions/changed-files from 22.2 to 41 in /.github/workflows by @dependabot in https://github.com/huggingface/accelerate/pull/2300
add back dvclive to tests by @dberenbaum in https://github.com/huggingface/accelerate/pull/2280
Fixes bug in swapping weights when replacing with Transformer-Engine layers by @sudhakarsingh27 in https://github.com/huggingface/accelerate/pull/2305
Fix breakpoint API in test_script.py on TPU. by @vanbasten23 in https://github.com/huggingface/accelerate/pull/2263
make test_state_checkpointing device agnostic by @statelesshz in https://github.com/huggingface/accelerate/pull/2290
[deepspeed] documentation by @stas00 in https://github.com/huggingface/accelerate/pull/2296
Add more missing items by @muellerzr in https://github.com/huggingface/accelerate/pull/2309
Update docs: Add warning for device_map=None for load_checkpoint_and_dispatch by @PhilJd in https://github.com/huggingface/accelerate/pull/2308
[deepspeed] fix setting auto values for comm buffers by @stas00 in https://github.com/huggingface/accelerate/pull/2295
DeepSpeed refactoring by @pacman100 in https://github.com/huggingface/accelerate/pull/2313
Fix DeepSpeed related regression by @pacman100 in https://github.com/huggingface/accelerate/pull/2304
Update test_deepspeed.py by @pacman100 in https://github.com/huggingface/accelerate/pull/2323
Bring old seed technique back by @muellerzr in https://github.com/huggingface/accelerate/pull/2319
Fix batch_size sanity check in prepare_data_loader by @izhx in https://github.com/huggingface/accelerate/pull/2310
Params4bit added to bnb classes in set_module_tensor_to_device() by @poedator in https://github.com/huggingface/accelerate/pull/2315
Fix infer_auto_device_map when tied weights share the same prefix name by @fxmarty in https://github.com/huggingface/accelerate/pull/2324

New Contributors

@fxmarty made their first contribution in https://github.com/huggingface/accelerate/pull/2116
@RMSnow made their first contribution in https://github.com/huggingface/accelerate/pull/2225
@jxysoft made their first contribution in https://github.com/huggingface/accelerate/pull/2227
@vanbasten23 made their first contribution in https://github.com/huggingface/accelerate/pull/2242
@Justin900429 made their first contribution in https://github.com/huggingface/accelerate/pull/2257
@dependabot made their first contribution in https://github.com/huggingface/accelerate/pull/2300
@sudhakarsingh27 made their first contribution in https://github.com/huggingface/accelerate/pull/2305
@PhilJd made their first contribution in https://github.com/huggingface/accelerate/pull/2308
@izhx made their first contribution in https://github.com/huggingface/accelerate/pull/2310
@poedator made their first contribution in https://github.com/huggingface/accelerate/pull/2315

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.25.0...v0.26.0

Dec 1, 2023

v0.25.0: safetensors by default, new trackers, and plenty of bug fixes

Safetensors default

As of this release, safetensors will be the default format saved when applicable! To read more about safetensors and why it's best to use it for safety (and not pickle/torch.save), check it out here

New Experiment Trackers

This release has two new experiment trackers, ClearML and DVCLive!

To use them, just pass clear_ml or dvclive to log_with in the Accelerator init. h/t to @eugen-ajechiloae-clearml and @dberenbaum

DeepSpeed

Accelerate's DeepSpeed integration now supports NPU devices, h/t to @statelesshz
DeepSpeed can now be launched via accelerate on single GPU setups

FSDP

FSDP had a huge refactoring so that the interface when using FSDP is the exact same as every other scenario when using accelerate. No more needing to call accelerator.prepare() twice!

Other useful enhancements

We now raise and try to disable P2P communications on consumer GPUs for the 3090 series and beyond. Without this users were seeing timeout issues and the like as NVIDIA dropped P2P support. If using accelerate launch we will automatically disable, and if we sense that it is still enabled on distributed setups using 3090's +, we will raise an error.
When doing .gather(), if tensors are on different devices we explicitly will raise an error (for now only valid on CUDA)

Bug fixes

Fixed a bug that caused dataloaders to not shuffle despite shuffle=True when using multiple GPUs and the new SeedableRandomSampler.

General Changelog

Add logs offloading by @SunMarc in https://github.com/huggingface/accelerate/pull/2075
Add ClearML tracker by @eugen-ajechiloae-clearml in https://github.com/huggingface/accelerate/pull/2034
CRITICAL: fix failing ci by @muellerzr in https://github.com/huggingface/accelerate/pull/2088
Fix flag typo by @kuza55 in https://github.com/huggingface/accelerate/pull/2090
Fix batch sampler by @muellerzr in https://github.com/huggingface/accelerate/pull/2097
fixed ip address typo by @Fluder-Paradyne in https://github.com/huggingface/accelerate/pull/2099
Fix memory leak in fp8 causing OOM (and potentially 3x vRAM usage) by @muellerzr in https://github.com/huggingface/accelerate/pull/2089
fix warning when offload by @SunMarc in https://github.com/huggingface/accelerate/pull/2105
Always use SeedableRandomSampler by @muellerzr in https://github.com/huggingface/accelerate/pull/2110
Fix issue with tests by @muellerzr in https://github.com/huggingface/accelerate/pull/2111
Make SeedableRandomSampler the default always by @muellerzr in https://github.com/huggingface/accelerate/pull/2117
Use "and" instead of comma in Bibtex citation by @qgallouedec in https://github.com/huggingface/accelerate/pull/2119
Add explicit error if empty batch received by @YuryYakhno in https://github.com/huggingface/accelerate/pull/2115
Allow for ACCELERATE_SEED env var by @muellerzr in https://github.com/huggingface/accelerate/pull/2126
add DeepSpeed support for NPU by @statelesshz in https://github.com/huggingface/accelerate/pull/2054
Sync states for npu fsdp by @jq460494839 in https://github.com/huggingface/accelerate/pull/2113
Fix import error when torch>=2.0.1 and torch.distributed is disabled by @natsukium in https://github.com/huggingface/accelerate/pull/2121
Make safetensors the default by @muellerzr in https://github.com/huggingface/accelerate/pull/2120
Raise error when saving with param on meta device by @SunMarc in https://github.com/huggingface/accelerate/pull/2132
Leave native save as False by @muellerzr in https://github.com/huggingface/accelerate/pull/2138
fix retie_parameters by @SunMarc in https://github.com/huggingface/accelerate/pull/2137
Deal with shared memory scenarios by @muellerzr in https://github.com/huggingface/accelerate/pull/2136
specify config file path on README by @kwonmha in https://github.com/huggingface/accelerate/pull/2140
Fix safetensors contiguous by @SunMarc in https://github.com/huggingface/accelerate/pull/2145
Fix more tests by @muellerzr in https://github.com/huggingface/accelerate/pull/2146
[docs] fixed a couple of broken links by @MKhalusova in https://github.com/huggingface/accelerate/pull/2147
[docs] troubleshooting guide by @MKhalusova in https://github.com/huggingface/accelerate/pull/2133
[Docs] fix doc typos by @kashif in https://github.com/huggingface/accelerate/pull/2150
Add note about GradientState being in-sync with the dataloader by default by @muellerzr in https://github.com/huggingface/accelerate/pull/2134
Deprecated runner stuff by @muellerzr in https://github.com/huggingface/accelerate/pull/2152
Add examples to tests by @muellerzr in https://github.com/huggingface/accelerate/pull/2131
Disable pypi for merge workflows + fix trainer tests by @muellerzr in https://github.com/huggingface/accelerate/pull/2153
Adds dvclive tracker by @dberenbaum in https://github.com/huggingface/accelerate/pull/2139
check port availability only in main deepspeed/torchrun launcher by @Jingru in https://github.com/huggingface/accelerate/pull/2078
Do not attempt to pad nested tensors by @frankier in https://github.com/huggingface/accelerate/pull/2041
Add warning for problematic libraries by @muellerzr in https://github.com/huggingface/accelerate/pull/2151
Add ZeRO++ to DeepSpeed usage docs by @SumanthRH in https://github.com/huggingface/accelerate/pull/2166
Fix Megatron-LM Arguments Bug by @yuanenming in https://github.com/huggingface/accelerate/pull/2168
Fix non persistant buffer dispatch by @SunMarc in https://github.com/huggingface/accelerate/pull/1941
Updated torchrun instructions by @TJ-Solergibert in https://github.com/huggingface/accelerate/pull/2096
New CI Runners by @muellerzr in https://github.com/huggingface/accelerate/pull/2087
Revert "New CI Runners" by @muellerzr in https://github.com/huggingface/accelerate/pull/2172
[Working again] New CI by @muellerzr in https://github.com/huggingface/accelerate/pull/2173
fsdp refactoring by @pacman100 in https://github.com/huggingface/accelerate/pull/2177
Pin DVC by @muellerzr in https://github.com/huggingface/accelerate/pull/2196
Apply DVC warning to Accelerate by @muellerzr in https://github.com/huggingface/accelerate/pull/2197
Explicitly disable P2P using launch, and pick up in state if a user will face issues. by @muellerzr in https://github.com/huggingface/accelerate/pull/2195
Better error when device mismatches when calling gather() on CUDA by @muellerzr in https://github.com/huggingface/accelerate/pull/2180
unpins dvc by @dberenbaum in https://github.com/huggingface/accelerate/pull/2200
Assemble state dictionary for offloaded models by @blbadger in https://github.com/huggingface/accelerate/pull/2156
Allow deepspeed without distributed launcher by @pacman100 in https://github.com/huggingface/accelerate/pull/2204

New Contributors

@eugen-ajechiloae-clearml made their first contribution in https://github.com/huggingface/accelerate/pull/2034
@kuza55 made their first contribution in https://github.com/huggingface/accelerate/pull/2090
@Fluder-Paradyne made their first contribution in https://github.com/huggingface/accelerate/pull/2099
@YuryYakhno made their first contribution in https://github.com/huggingface/accelerate/pull/2115
@jq460494839 made their first contribution in https://github.com/huggingface/accelerate/pull/2113
@kwonmha made their first contribution in https://github.com/huggingface/accelerate/pull/2140
@dberenbaum made their first contribution in https://github.com/huggingface/accelerate/pull/2139
@Jingru made their first contribution in https://github.com/huggingface/accelerate/pull/2078
@frankier made their first contribution in https://github.com/huggingface/accelerate/pull/2041
@yuanenming made their first contribution in https://github.com/huggingface/accelerate/pull/2168
@TJ-Solergibert made their first contribution in https://github.com/huggingface/accelerate/pull/2096
@blbadger made their first contribution in https://github.com/huggingface/accelerate/pull/2156

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.24.1...v0.25.0

Oct 30, 2023

v0.24.1: Patch Release for Samplers

Fixes https://github.com/huggingface/accelerate/issues/2091 by changing how checking for custom samplers is done

Oct 24, 2023

v0.24.0: Improved Reproducability, Bug fixes, and other Small Improvements

Improved Reproducibility

One critical issue with Accelerate is training runs were different when using an iterable dataset, no matter what seeds were set. v0.24.0 introduces the dataloader.set_epoch() function to all Accelerate DataLoaders, where if the underlying dataset (or sampler) has the ability to set the epoch for reproducability it will do so. This is similar to the implementation already existing in transformers. To use:

dataloader = accelerator.prepare(dataloader)
# Say we want to resume at epoch/iteration 2
dataloader.set_epoch(2)

For more information see this PR, we will update the docs on a subsequent release with more information on this API.

Documentation

The quick tour docs have gotten a complete makeover thanks to @MKhalusova. Take a look here
We also now have documentation on how to perform multinode training, see the launch docs

Internal structure

Shared file systems are now supported under save and save_state via the ProjectConfiguration dataclass. See #1953 for more info.
FSDP can now be used for bfloat16 mixed precision via torch.autocast
all_gather_into_tensor is now used as the main gather operation, reducing memory in the cases of big tensors
Specifying drop_last=True will now properly have the desired affect when performing Accelerator().gather_for_metrics()

What's Changed

Update big_modeling.md by @kli-casia in https://github.com/huggingface/accelerate/pull/1976
Fix model copy after dispatch_model by @austinapatel in https://github.com/huggingface/accelerate/pull/1971
FIX: Automatic checkpoint path inference issue by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/1989
Fix skip first batch for deepspeed example by @SumanthRH in https://github.com/huggingface/accelerate/pull/2001
[docs] Quick tour refactor by @MKhalusova in https://github.com/huggingface/accelerate/pull/2008
Add basic documentation for multi node training by @SumanthRH in https://github.com/huggingface/accelerate/pull/1988
update torch_dynamo backends by @SunMarc in https://github.com/huggingface/accelerate/pull/1992
Sync states for xpu fsdp by @abhilash1910 in https://github.com/huggingface/accelerate/pull/2005
update fsdp docs by @pacman100 in https://github.com/huggingface/accelerate/pull/2026
Enable shared file system with save and save_state via ProjectConfiguration by @muellerzr in https://github.com/huggingface/accelerate/pull/1953
Fix save on each node by @muellerzr in https://github.com/huggingface/accelerate/pull/2036
Allow FSDP to use with torch.autocast for bfloat16 mixed precision by @brcps12 in https://github.com/huggingface/accelerate/pull/2033
Fix DeepSpeed version to <0.11 by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/2043
Unpin deepspeed by @muellerzr in https://github.com/huggingface/accelerate/pull/2044
Reduce memory by using all_gather_into_tensor by @muellerzr in https://github.com/huggingface/accelerate/pull/1968
Safely end training even if trackers weren't initialized by @Ben-Epstein in https://github.com/huggingface/accelerate/pull/1994
Fix integration CI by @muellerzr in https://github.com/huggingface/accelerate/pull/2047
Make fsdp ram efficient loading optional by @pacman100 in https://github.com/huggingface/accelerate/pull/2037
Let drop_last modify gather_for_metrics by @muellerzr in https://github.com/huggingface/accelerate/pull/2048
fix docstring by @zhangsibo1129 in https://github.com/huggingface/accelerate/pull/2053
Fix stalebot by @muellerzr in https://github.com/huggingface/accelerate/pull/2052
Add space to docs by @muellerzr in https://github.com/huggingface/accelerate/pull/2055
Fix the error when the "train_batch_size" is absent in DeepSpeed config by @LZHgrla in https://github.com/huggingface/accelerate/pull/2060
remove unused constants by @statelesshz in https://github.com/huggingface/accelerate/pull/2045
fix: remove useless token by @rtrompier in https://github.com/huggingface/accelerate/pull/2069
DOC: Fix broken link to designing a device map by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/2073
Let iterable dataset shard have a length if implemented by @muellerzr in https://github.com/huggingface/accelerate/pull/2066
Allow for samplers to be seedable and reproducable by @muellerzr in https://github.com/huggingface/accelerate/pull/2057
Fix docstring typo by @qgallouedec in https://github.com/huggingface/accelerate/pull/2072
Warn when kernel version is too low on Linux by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/2077

New Contributors

@kli-casia made their first contribution in https://github.com/huggingface/accelerate/pull/1976
@MKhalusova made their first contribution in https://github.com/huggingface/accelerate/pull/2008
@brcps12 made their first contribution in https://github.com/huggingface/accelerate/pull/2033
@Ben-Epstein made their first contribution in https://github.com/huggingface/accelerate/pull/1994
@zhangsibo1129 made their first contribution in https://github.com/huggingface/accelerate/pull/2053
@LZHgrla made their first contribution in https://github.com/huggingface/accelerate/pull/2060
@rtrompier made their first contribution in https://github.com/huggingface/accelerate/pull/2069
@qgallouedec made their first contribution in https://github.com/huggingface/accelerate/pull/2072

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.23.0...v0.24.0

Sep 14, 2023

v0.23.0: Model Memory Estimation tool, Breakpoint API, Multi-Node Notebook Launcher Support, and more!

Model Memory Estimator

A new model estimation tool to help calculate how much memory is needed for inference has been added. This does not download the pretrained weights, and utilizes init_empty_weights to stay memory efficient during the calculation.

Usage directions:

accelerate estimate-memory {model_name} --library {library_name} --dtypes fp16 int8

Or:

from accelerate.commands.estimate import estimate_command_parser, estimate_command, gather_data

parser = estimate_command_parser()
args = parser.parse_args(["bert-base-cased", "--dtypes", "float32"])
output = gather_data(args)

🤗 Hub is a first-class citizen

We've made the huggingface_hub library a first-class citizen of the framework! While this is mainly for the model estimation tool, this opens the doors for further integrations should they be wanted

`Accelerator` Enhancements:

gather_for_metrics will now also de-dupe for non-tensor objects. See #1937
mixed_precision="bf16" support on NPU devices. See #1949
New breakpoint API to help when dealing with trying to break from a condition on a single process. See #1940

Notebook Launcher Enhancements:

The notebook launcher now supports launching across multiple nodes! See #1913

FSDP Enhancements:

Activation checkpointing is now natively supported in the framework. See https://github.com/huggingface/accelerate/pull/1891
torch.compile support was fixed. See #1919

DeepSpeed Enhancements:

XPU/ccl support (#1827)
Easier gradient accumulation support, simply set gradient_accumulation_steps to "auto" in your deepspeed config, and Accelerate will use the one passed to Accelerator instead (#1901)
Support for custom schedulers and deepspeed optimizers (#1909)

What's Changed

Update release instructions by @sgugger in https://github.com/huggingface/accelerate/pull/1877
fix detach_hook by @SunMarc in https://github.com/huggingface/accelerate/pull/1880
Enable power users to bypass device_map="auto" training block by @muellerzr in https://github.com/huggingface/accelerate/pull/1881
Introduce model memory estimator by @muellerzr in https://github.com/huggingface/accelerate/pull/1876
Update with new url for explore by @muellerzr in https://github.com/huggingface/accelerate/pull/1884
Enable a token to be used by @muellerzr in https://github.com/huggingface/accelerate/pull/1886
Add doc on model memory usage by @muellerzr in https://github.com/huggingface/accelerate/pull/1887
Add hub as core dep by @muellerzr in https://github.com/huggingface/accelerate/pull/1885
update import of deepspeed integration from transformers by @pacman100 in https://github.com/huggingface/accelerate/pull/1894
Final nits on model util by @muellerzr in https://github.com/huggingface/accelerate/pull/1896
Fix nb launcher test by @muellerzr in https://github.com/huggingface/accelerate/pull/1899
Add FSDP activation checkpointing feature by @arde171 in https://github.com/huggingface/accelerate/pull/1891
Solve at least one failing test by @muellerzr in https://github.com/huggingface/accelerate/pull/1898
Deepspeed integration for XPU/ccl by @abhilash1910 in https://github.com/huggingface/accelerate/pull/1827
Add PR template by @muellerzr in https://github.com/huggingface/accelerate/pull/1906
deepspeed grad_acc_steps fixes by @pacman100 in https://github.com/huggingface/accelerate/pull/1901
Skip pypi transformers until release by @muellerzr in https://github.com/huggingface/accelerate/pull/1911
Fix docker images by @muellerzr in https://github.com/huggingface/accelerate/pull/1910
Use hosted CI runners for building docker images by @muellerzr in https://github.com/huggingface/accelerate/pull/1915
fix: add debug argument to sagemaker configuration by @maximegmd in https://github.com/huggingface/accelerate/pull/1904
improve help info when run accelerate config on npu by @statelesshz in https://github.com/huggingface/accelerate/pull/1895
support logging with mlflow in case of mlflow-skinny installed by @ghtaro in https://github.com/huggingface/accelerate/pull/1874
More CI fun - run all test parts always by @muellerzr in https://github.com/huggingface/accelerate/pull/1916
Expose auto in dataclass by @muellerzr in https://github.com/huggingface/accelerate/pull/1914
Add support for deepspeed optimizer and custom scheduler by @pacman100 in https://github.com/huggingface/accelerate/pull/1909
reduce gradient first for XLA when unscaling the gradients in mixed precision training with AMP. by @statelesshz in https://github.com/huggingface/accelerate/pull/1926
Check for invalid keys by @muellerzr in https://github.com/huggingface/accelerate/pull/1935
clean num devices by @SunMarc in https://github.com/huggingface/accelerate/pull/1936
Bring back pypi to runners by @muellerzr in https://github.com/huggingface/accelerate/pull/1939
Support multi-node notebook launching by @ggaaooppeenngg in https://github.com/huggingface/accelerate/pull/1913
fix the fsdp docs by @pacman100 in https://github.com/huggingface/accelerate/pull/1947
Fix docs by @ggaaooppeenngg in https://github.com/huggingface/accelerate/pull/1951
Protect tensorflow dependency by @SunMarc in https://github.com/huggingface/accelerate/pull/1959
fix safetensor saving by @SunMarc in https://github.com/huggingface/accelerate/pull/1954
FIX: patch_environment restores pre-existing environment variables when finished by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/1960
Better guards for slow imports by @muellerzr in https://github.com/huggingface/accelerate/pull/1963
[Tests] Finish all todos by @younesbelkada in https://github.com/huggingface/accelerate/pull/1957
Rm strtobool by @muellerzr in https://github.com/huggingface/accelerate/pull/1964
Implementing gather_for_metrics with dedup for non tensor objects by @Lorenzobattistela in https://github.com/huggingface/accelerate/pull/1937
add bf16 mixed precision support for NPU by @statelesshz in https://github.com/huggingface/accelerate/pull/1949
Introduce breakpoint API by @muellerzr in https://github.com/huggingface/accelerate/pull/1940
fix torch compile with FSDP by @pacman100 in https://github.com/huggingface/accelerate/pull/1919
Add force_hooks to dispatch_model by @austinapatel in https://github.com/huggingface/accelerate/pull/1969
update FSDP and DeepSpeed docs by @pacman100 in https://github.com/huggingface/accelerate/pull/1973
Flex fix patch for accelerate by @abhilash1910 in https://github.com/huggingface/accelerate/pull/1972
Remove checkpoints only on main process by @Kepnu4 in https://github.com/huggingface/accelerate/pull/1974

New Contributors

@arde171 made their first contribution in https://github.com/huggingface/accelerate/pull/1891
@maximegmd made their first contribution in https://github.com/huggingface/accelerate/pull/1904
@ghtaro made their first contribution in https://github.com/huggingface/accelerate/pull/1874
@ggaaooppeenngg made their first contribution in https://github.com/huggingface/accelerate/pull/1913
@Lorenzobattistela made their first contribution in https://github.com/huggingface/accelerate/pull/1937
@austinapatel made their first contribution in https://github.com/huggingface/accelerate/pull/1969
@Kepnu4 made their first contribution in https://github.com/huggingface/accelerate/pull/1974

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.22.0...v0.23.0

Aug 23, 2023

v0.22.0: Distributed operation framework, Gradient Accumulation enhancements, FSDP enhancements, and more!

Experimental distributed operations checking framework

A new framework has been introduced which can help catch timeout errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1" to enable this. Read more in the docs, introduced via https://github.com/huggingface/accelerate/pull/1756

`Accelerator.load_state` can now load the most recent checkpoint automatically

If a ProjectConfiguration has been made, using accelerator.load_state() (without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via https://github.com/huggingface/accelerate/pull/1741

Multiple enhancements to gradient accumulation

In this release multiple new enhancements to distributed gradient accumulation have been added.

accelerator.accumulate() now supports passing in multiple models introduced via https://github.com/huggingface/accelerate/pull/1708
A util has been introduced to perform multiple forwards, then multiple backwards, and finally sync the gradients only on the last .backward() via https://github.com/huggingface/accelerate/pull/1726

FSDP Changes

FSDP support has been added for NPU and XPU devices via https://github.com/huggingface/accelerate/pull/1803 and https://github.com/huggingface/accelerate/pull/1806
A new method for supporting RAM-efficient loading of models with FSDP has been added via https://github.com/huggingface/accelerate/pull/1777

DataLoader Changes

Custom slice functions are now supported in the DataLoaderDispatcher added via https://github.com/huggingface/accelerate/pull/1846

What's New?

fix failing test on 8GPU by @statelesshz in https://github.com/huggingface/accelerate/pull/1724
Better control over DDP's no_sync by @NouamaneTazi in https://github.com/huggingface/accelerate/pull/1726
Get rid of calling get_scale() by patching the step method of optimizer. by @yuxinyuan in https://github.com/huggingface/accelerate/pull/1720
fix the bug in npu by @statelesshz in https://github.com/huggingface/accelerate/pull/1728
Adding a shape check for set_module_tensor_to_device. by @Narsil in https://github.com/huggingface/accelerate/pull/1731
Fix errors when optimizer is not a Pytorch optimizer. by @yuxinyuan in https://github.com/huggingface/accelerate/pull/1733
Make balanced memory able to work with non contiguous GPUs ids by @thomwolf in https://github.com/huggingface/accelerate/pull/1734
Fixed typo in __repr__ of AlignDevicesHook by @KacperWyrwal in https://github.com/huggingface/accelerate/pull/1735
Update docs by @muellerzr in https://github.com/huggingface/accelerate/pull/1736
Fixed the bug that split dict incorrectly by @yuangpeng in https://github.com/huggingface/accelerate/pull/1742
Let load_state automatically grab the latest save by @muellerzr in https://github.com/huggingface/accelerate/pull/1741
fix KwargsHandler.to_kwargs not working with os.environ initialization in __post_init__ by @CyCle1024 in https://github.com/huggingface/accelerate/pull/1738
fix typo by @cauyxy in https://github.com/huggingface/accelerate/pull/1747
Check for misconfiguration of single node & single GPU by @muellerzr in https://github.com/huggingface/accelerate/pull/1746
Remove unused constant by @muellerzr in https://github.com/huggingface/accelerate/pull/1749
Rework new constant for operations by @muellerzr in https://github.com/huggingface/accelerate/pull/1748
Expose autocast kwargs and simplify autocast wrapper by @muellerzr in https://github.com/huggingface/accelerate/pull/1740
Fix FSDP related issues by @pacman100 in https://github.com/huggingface/accelerate/pull/1745
FSDP enhancements and fixes by @pacman100 in https://github.com/huggingface/accelerate/pull/1753
Fix check failure in Accelerator.save_state using multi-gpu by @CyCle1024 in https://github.com/huggingface/accelerate/pull/1760
Fix error when max_memory argument is in unexpected order by @ranchlai in https://github.com/huggingface/accelerate/pull/1759
Fix offload on disk when executing on CPU by @sgugger in https://github.com/huggingface/accelerate/pull/1762
Change is_aim_available() function to not match aim >= 4.0.0 by @alberttorosyan in https://github.com/huggingface/accelerate/pull/1769
Introduce an experimental distributed operations framework by @muellerzr in https://github.com/huggingface/accelerate/pull/1756
Support wrapping multiple models in Accelerator.accumulate() by @yuxinyuan in https://github.com/huggingface/accelerate/pull/1708
Contigous on gather by @muellerzr in https://github.com/huggingface/accelerate/pull/1771
[FSDP] Fix load_fsdp_optimizer by @awgu in https://github.com/huggingface/accelerate/pull/1755
simplify and correct the deepspeed example by @pacman100 in https://github.com/huggingface/accelerate/pull/1775
Set ipex default in state by @muellerzr in https://github.com/huggingface/accelerate/pull/1776
Fix import error when torch>=2.0.1 and torch.distributed is disabled by @natsukium in https://github.com/huggingface/accelerate/pull/1800
reserve 10% GPU in get_balanced_memory to avoid OOM by @ranchlai in https://github.com/huggingface/accelerate/pull/1798
add support of float memory size in convert_file_size_to_int by @ranchlai in https://github.com/huggingface/accelerate/pull/1799
Allow users to resume from previous wandb runs with allow_val_change by @SumanthRH in https://github.com/huggingface/accelerate/pull/1796
Add FSDP for XPU by @abhilash1910 in https://github.com/huggingface/accelerate/pull/1803
Add FSDP for NPU by @statelesshz in https://github.com/huggingface/accelerate/pull/1806
Fix pytest import by @muellerzr in https://github.com/huggingface/accelerate/pull/1808
More specific logging in gather_for_metrics by @dleve123 in https://github.com/huggingface/accelerate/pull/1784
Detect device map auto and raise a helpful error when trying to not use model parallelism by @muellerzr in https://github.com/huggingface/accelerate/pull/1810
Typo fix by @muellerzr in https://github.com/huggingface/accelerate/pull/1812
Expand device-map warning by @muellerzr in https://github.com/huggingface/accelerate/pull/1819
Update bibtex to reflect team growth by @muellerzr in https://github.com/huggingface/accelerate/pull/1820
Improve docs on grad accumulation by @vwxyzjn in https://github.com/huggingface/accelerate/pull/1817
add warning when using to and cuda by @SunMarc in https://github.com/huggingface/accelerate/pull/1790
Fix bnb import by @muellerzr in https://github.com/huggingface/accelerate/pull/1813
Update docs and docstrings to match load_and_quantize_model arg by @JonathanRayner in https://github.com/huggingface/accelerate/pull/1822
Expose a bit of args/docstring fixup by @muellerzr in https://github.com/huggingface/accelerate/pull/1824
Better test by @muellerzr in https://github.com/huggingface/accelerate/pull/1825
Minor idiomatic change for fp8 check. by @float-trip in https://github.com/huggingface/accelerate/pull/1829
Use device as context manager for init_on_device by @shingjan in https://github.com/huggingface/accelerate/pull/1826
Ipex bug fix for device properties in modelling by @abhilash1910 in https://github.com/huggingface/accelerate/pull/1834
FIX: Bug with unwrap_model and keep_fp32_wrapper=False by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/1838
Fix verify_device_map by @Rexhaif in https://github.com/huggingface/accelerate/pull/1842
Change CUDA check by @muellerzr in https://github.com/huggingface/accelerate/pull/1833
Fix the noneffective parameter: gpu_ids (Rel. Issue #1848) by @devymex in https://github.com/huggingface/accelerate/pull/1850
support for ram efficient loading of model with FSDP by @pacman100 in https://github.com/huggingface/accelerate/pull/1777
Loading logic safetensors by @SunMarc in https://github.com/huggingface/accelerate/pull/1853
fix dispatch for quantized model by @SunMarc in https://github.com/huggingface/accelerate/pull/1855
Update fsdp_with_peak_mem_tracking.py by @pacman100 in https://github.com/huggingface/accelerate/pull/1856
Add env variable for init_on_device by @shingjan in https://github.com/huggingface/accelerate/pull/1852
remove casting to FP32 when saving state dict by @pacman100 in https://github.com/huggingface/accelerate/pull/1868
support custom slice function in DataLoaderDispatcher by @thevasudevgupta in https://github.com/huggingface/accelerate/pull/1846
Include a note to the forums in the bug report by @muellerzr in https://github.com/huggingface/accelerate/pull/1871

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@yuxinyuan
- Support wrapping multiple models in Accelerator.accumulate() (#1708)
- Fix errors when optimizer is not a Pytorch optimizer. (#1733)
- Get rid of calling get_scale() by patching the step method of optimizer. (#1720)
@NouamaneTazi
- Better control over DDP's no_sync (#1726)
@abhilash1910
- Add FSDP for XPU (#1803)
- Ipex bug fix for device properties in modelling (#1834)
@statelesshz
- Add FSDP for NPU (#1806)
- fix failing test on 8GPU (#1724)
- fix the bug in npu (#1728)
@thevasudevgupta
- support custom slice function in DataLoaderDispatcher (#1846)

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.21.0...v0.22.0

Jul 13, 2023

v0.21.0: Model quantization and NPUs

Model quantization with bitsandbytes

You can now quantize any model (no just Transformer models) using Accelerate. This is mainly for models having a lot of linear layers. See the documentation for more information!

Bnb quantization by @SunMarc in #1626

Support for Ascend NPUs

Accelerate now supports Ascend NPUs.

Add Ascend NPU accelerator support by @statelesshz in #1676

What's new?

Accelerate now requires Python 3.8+ and PyTorch 1.10+ :

🚨🚨🚨 Spring cleaning: Python 3.8 🚨🚨🚨 by @muellerzr in #1661
🚨🚨🚨 Spring cleaning: PyTorch 1.10 🚨🚨🚨 by @muellerzr in #1662
[doc build] Use secrets by @mishig25 in #1551
Update launch.mdx by @LiamSwayne in #1553
Avoid double wrapping of all accelerate.prepare objects by @muellerzr in #1555
Update README.md by @LiamSwayne in #1556
Fix load_state_dict when there is one device and disk by @sgugger in #1557
Fix tests not being ran on multi-GPU nightly by @muellerzr in #1558
fix the typo when setting the "_accelerator_prepared" attribute by @Yura52 in #1560
[core] Fix possibility to passNoneType objects in prepare by @younesbelkada in #1561
Reset dataloader end_of_datalaoder at each iter by @sgugger in #1562
Update big_modeling.mdx by @LiamSwayne in #1564
[bnb] Fix failing int8 tests by @younesbelkada in #1567
Update gradient sync docs to reflect importance of optimizer.step() by @dleve123 in #1565
Update mixed precision integrations in README by @sgugger in #1569
Raise error instead of warn by @muellerzr in #1568
Introduce listify, fix tensorboard silently failing by @muellerzr in #1570
Check for bak and expand docs on directory structure by @muellerzr in #1571
Perminant solution by @muellerzr in #1577
fix the bug in xpu by @mingxiaoh in #1508
Make sure that we only set is_accelerator_prepared on items accelerate actually prepares by @muellerzr in #1578
Expand prepare() doc by @muellerzr in #1580
Get Torch version using importlib instead of pkg_resources by @catwell in #1585
improve oob performance when use mpirun to start DDP finetune without accelerate launch by @sywangyi in #1575
Update training_tpu.mdx by @LiamSwayne in #1582
Return false if CUDA available by @muellerzr in #1581
fix logger level by @caopulan in #1579
Fix test by @muellerzr in #1586
Update checkpoint.mdx by @LiamSwayne in #1587
FSDP updates by @pacman100 in #1576
Update modeling.py by @ain-soph in #1595
Integration tests by @muellerzr in #1593
Add triggers for CI workflow by @muellerzr in #1597
Remove asking xpu plugin for non xpu devices by @abhilash1910 in #1594
Remove GPU safetensors env variable by @sgugger in #1603
reset end_of_dataloader for dataloader_dispatcher by @megavaz in #1609
fix for arc gpus by @abhilash1910 in #1615
Ignore low_zero option when only device is available by @sgugger in #1617
Fix failing multinode tests by @muellerzr in #1616
Doc to md by @sgugger in #1618
Fix tb issue by @muellerzr in #1623
Fix workflow by @muellerzr in #1625
Fix transformers sync bug with accumulate by @muellerzr in #1624
fixes offload dtype by @SunMarc in #1631
fix: Megatron is not installed. please build it from source. by @yuanwu2017 in #1636
deepspeed z2/z1 state_dict bloating fix by @pacman100 in #1638
Swap disable rich by @muellerzr in #1640
fix autocasting bug by @pacman100 in #1637
fix modeling low zero by @abhilash1910 in #1634
Add skorch to runners by @muellerzr in #1646
add save model by @SunMarc in #1641
Change dispatch_model when we have only one device by @SunMarc in #1648
Doc save model by @SunMarc in #1650
Fix device_map by @SunMarc in #1651
Check for port usage before launch by @muellerzr in #1656
[BigModeling] Add missing check for quantized models by @younesbelkada in #1652
Bump integration by @muellerzr in #1658
TIL by @muellerzr in #1657
docker cpu py version by @muellerzr in #1659
[BigModeling] Final fix for dispatch int8 and fp4 models by @younesbelkada in #1660
remove safetensor dep on shard_checkpoint by @SunMarc in #1664
change the import place to avoid import error by @pacman100 in #1653
Update broken Runhouse link in examples/README.md by @dongreenberg in #1668
Bnb quantization by @SunMarc in #1626
replace save funct in doc by @SunMarc in #1672
Doc big model inference by @SunMarc in #1670
Add docs for saving Transformers models by @deppen8 in #1671
fix bnb tests by @SunMarc in #1679
Fix workflow CI by @muellerzr in #1690
remove duplicate class by @SunMarc in #1691
update readme in examples by @statelesshz in #1678
Fix nightly tests by @muellerzr in #1696
Fixup docs by @muellerzr in #1697
Improve quality errors by @muellerzr in #1698
Move mixed precision wrapping ahead of DDP/FSDP wrapping by @ChenWu98 in #1682
Add offload for 8-bit model by @SunMarc in #1699
Deepcopy on Accelerator to return self by @muellerzr in #1694
Update tracking.md by @stevhliu in #1702
Skip tests when bnb isn't available by @muellerzr in #1706
Fix launcher validation by @abhilash1910 in #1705
Fixes for issue #1683: failed to run accelerate config in colab by @Erickrus in #1692
Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait when the dataset is an IterDataPipe during multi-process training. by @yuxinyuan in #1709
add multi_gpu decorator by @SunMarc in #1712
Modify loading checkpoint behavior by @SunMarc in #1715
fix version by @SunMarc in #1701
Keep old behavior by @muellerzr in #1716
Optimize get_scale to reduce async calls by @muellerzr in #1718
Remove duplicate code by @muellerzr in #1717
New tactic by @muellerzr in #1719
add Comfy-UI by @pacman100 in #1723
add compatibility with peft by @SunMarc in #1725

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@LiamSwayne
- Update launch.mdx (#1553)
- Update README.md (#1556)
- Update big_modeling.mdx (#1564)
- Update training_tpu.mdx (#1582)
- Update checkpoint.mdx (#1587)
@mingxiaoh
- fix the bug in xpu (#1508)
@statelesshz
- update readme in examples (#1678)
- Add Ascend NPU accelerator support (#1676)
@ChenWu98
- Move mixed precision wrapping ahead of DDP/FSDP wrapping (#1682)

Previous 1 2 3 4 Next

Similar releases

Other sources from this team

Similar sources

Latest

v1.13.0

Source

@huggingface/accelerate

Tracking Since

Mar 5, 2021

Last fetched Apr 18, 2026

.json·.md·.atom

Accelerate

Dependency Changes

Core

New Script Behavior Changes

DataLoader Enhancements

FP8 Training Improvements

torchpippy no more, long live torch.distributed.pipelining

Fully Sharded Data Parallelism (FSDP)

New Examples

Bug Fixes

New Contributors

Full Changelog:

Detailed Full Changelog:

MUSA backend support and bugfixes

What's Changed

New Contributors

Core

Distributed Data Parallelism

FSDP

XPU

XLA

Examples

Full Changelog

New Contributors

Core

FSDP

Megatron

What's Changed

New Contributors

Patchfix

Core

Documentation

DeepSpeed

Megatron

Big Modeling

Bug Fixes

Full Changelog

New Contributors

Core

Big Model Inference

DeepSpeed

What's Changed

New Contributors

Core

Torch XLA

FSDP

launch changes

What's Changed

New Contributors

PyTorch 2.2.0 Support

PyTorch-Native Pipeline Parallel Inference

DeepSpeed

What's Changed

New Contributors

What's Changed

Support for MS-AMP

Core

Big Model Inference

FSDP and DeepSpeed

Bits and Bytes

Device Agnostic Testing

Bug Fixes

Major Contributors

General Changelog

New Contributors

Safetensors default

New Experiment Trackers

DeepSpeed

FSDP

Other useful enhancements

Bug fixes

General Changelog

New Contributors

Improved Reproducibility

Documentation

Internal structure

What's Changed

New Contributors

Model Memory Estimator

🤗 Hub is a first-class citizen

`torchpippy` no more, long live `torch.distributed.pipelining`

`launch` changes

`Accelerator` Enhancements:

`Accelerator.load_state` can now load the most recent checkpoint automatically