releases.shpreview
Hugging Face/Accelerate

Accelerate

$npx -y @buildinternet/releases show accelerate
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases1Avg0/wkVersionsv1.13.0
Sep 3, 2024
v0.34.0: StatefulDataLoader Support, FP8 Improvements, and PyTorch Updates!

Dependency Changes

  • Updated Safetensors Requirement: The library now requires safetensors version 0.4.3.
  • Added support for Numpy 2.0: The library now fully supports numpy 2.0.0

Core

New Script Behavior Changes

  • Process Group Management: PyTorch now requires users to destroy process groups after training. The accelerate library will handle this automatically with accelerator.end_training(), or you can do it manually using PartialState().destroy_process_group().
  • MLU Device Support: Added support for saving and loading RNG states on MLU devices by @huismiling
  • NPU Support: Corrected backend and distributed settings when using transfer_to_npu, ensuring better performance and compatibility.

DataLoader Enhancements

  • Stateful DataDataLoader: We are excited to announce that early support has been added for the StatefulDataLoader from torchdata, allowing better handling of data loading states. Enable by passing use_stateful_dataloader=True to the DataLoaderConfiguration, and when calling load_state() the DataLoader will automatically be resumed from its last step, no more having to iterate through passed batches.
  • Decoupled Data Loader Preparation: The prepare_data_loader() function is now independent of the Accelerator, giving you more flexibility towards which API levels you would like to use.
  • XLA Compatibility: Added support for skipping initial batches when using XLA.
  • Improved State Management: Bug fixes and enhancements for saving/loading DataLoader states, ensuring smoother training sessions.
  • Epoch Setting: Introduced the set_epoch function for MpDeviceLoaderWrapper.

FP8 Training Improvements

  • Enhanced FP8 Training: Fully Sharded Data Parallelism (FSDP) and DeepSpeed support now work seamlessly with TransformerEngine FP8 training, including better defaults for the quantized FP8 weights.
  • Integration baseline: We've added a new suite of examples and benchmarks to ensure that our TransformerEngine integration works exactly as intended. These scripts run one half using 🤗 Accelerate's integration, the other with raw TransformersEngine, providing users with a nice example of what we do under the hood with accelerate, and a good sanity check to make sure nothing breaks down over time. Find them here
  • Import Fixes: Resolved issues with import checks for the Transformers Engine that has downstream issues.
  • FP8 Docker Images: We've added new docker images for TransformerEngine and accelerate as well. Use docker pull huggingface/accelerate@gpu-fp8-transformerengine to quickly get an environment going.

torchpippy no more, long live torch.distributed.pipelining

  • With the latest PyTorch release, torchpippy is now fully integrated into torch core, and as a result we are exclusively supporting the PyTorch implementation from now on
  • There are breaking examples and changes that comes from this shift. Namely:
    • Tracing of inputs is done with a shape each GPU will see, rather than the size of the total batch. So for 2 GPUs, one should pass in an input of [1, n, n] rather than [2, n, n] as before.
    • We no longer support Encoder/Decoder models. PyTorch tracing for pipelining no longer supports encoder/decoder models, so the t5 example has been removed.
    • Computer vision model support currently does not work: There are some tracing issues regarding resnet's we are actively looking into.
  • If either of these changes are too breaking, we recommend pinning your accelerate version. If the encoder/decoder model support is actively blocking your inference using pippy, please open an issue and let us know. We can look towards adding in the old support for torchpippy potentially if needed.

Fully Sharded Data Parallelism (FSDP)

  • Environment Flexibility: Environment variables are now fully optional for FSDP, simplifying configuration. You can now fully create a FullyShardedDataParallelPlugin yourself manually with no need for environment patching:
from accelerate import FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(...)
  • FSDP RAM efficient loading: Added a utility to enable RAM-efficient model loading (by setting the proper environmental variable). This is generally needed if not using accelerate launch and need to ensure the env variables are setup properly for model loading:
from accelerate.utils import enable_fsdp_ram_efficient_loading, disable_fsdp_ram_efficient_loading
enable_fsdp_ram_efficient_loading()
  • Model State Dict Management: Enhanced support for unwrapping model state dicts in FSDP, making it easier to manage distributed models.

New Examples

Bug Fixes

New Contributors

Full Changelog:

Detailed Full Changelog:

Aug 8, 2024
v0.33.0: MUSA backend support and bugfixes

MUSA backend support and bugfixes

Small release this month, with key focuses on some added support for backends and bugs:

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.32.1...v0.33.0

Jul 3, 2024
v0.32.0: Profilers, new hooks, speedups, and more!

Core

Distributed Data Parallelism

FSDP

XPU

XLA

Examples

Full Changelog

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.31.0...v0.32.0

Jun 7, 2024
v0.31.0: Better support for sharded state dict with FSDP and Bugfixes

Core

FSDP

Megatron

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.30.1...v0.31.0

May 10, 2024
v0.30.1: Bugfixes

Patchfix

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.30.0...v0.30.1

May 3, 2024
v0.30.0: Advanced optimizer support, MoE DeepSpeed support, add upcasting for FSDP, and more

Core

Documentation

  • Through collaboration between @fabianlim (lead contribuitor), @stas00, @pacman100, and @muellerzr we have a new concept guide out for FSDP and DeepSpeed explicitly detailing how each interop and explaining fully and clearly how each of those work. This was a momumental effort by @fabianlim to ensure that everything can be as accurate as possible to users. I highly recommend visiting this new documentation, available here
  • New distributed inference examples have been added thanks to @SunMarc in https://github.com/huggingface/accelerate/pull/2672
  • Fixed some docs for using internal trackers by @brentyi in https://github.com/huggingface/accelerate/pull/2650

DeepSpeed

Megatron

Big Modeling

Bug Fixes

Full Changelog

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.29.3...v0.30.0

Apr 17, 2024
v0.29.3: Patchfix

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.29.2...v0.29.3

Apr 9, 2024
v0.29.2: Patchfix
Apr 5, 2024
v0.29.1: Patchfix

Fixed an import which would cause running accelerate CLI to fail if pytest wasn't installed

v0.29.0: NUMA affinity control, MLU Support, and DeepSpeed Improvements

Core

  • Accelerate can now optimize NUMA affinity, which can help increase throughput on NVIDIA multi-GPU systems. To enable it either follow the prompt during accelerate config, set the ACCELERATE_CPU_AFFINITY=1 env variable, or manually using the following:
from accelerate.utils import set_numa_affinity

# For GPU 0
set_numa_affinity(0)

Big thanks to @stas00 for the recommendation, request, and feedback during development

Big Model Inference

DeepSpeed

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.28.0...v0.29.0

Mar 12, 2024
v0.28.0: DataLoaderConfig, XLA improvements, FSDP + QLORA foundations, Gradient Synchronization Tweaks, and Bug Fixes

Core

  • Introduce a DataLoaderConfiguration and begin deprecation of arguments in the Accelerator
+from accelerate import DataLoaderConfiguration
+dl_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dispatch_batches=True)
+accelerator = Accelerator(dataloader_config=dl_config)
from accelerate import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(
+    num_steps=2, 
    sync_each_batch=sync_each_batch
)
accelerator = Accelerator(gradient_accumulation_plugin=plugin)

Torch XLA

FSDP

launch changes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.27.2...v0.28.0

Feb 9, 2024
v0.27.0: PyTorch 2.2.0 Support, PyTorch-Native Pipeline Parallism, DeepSpeed XPU support, and Bug Fixes

PyTorch 2.2.0 Support

With the latest release of PyTorch 2.2.0, we've guaranteed that there are no breaking changes regarding it

PyTorch-Native Pipeline Parallel Inference

With this release we are excited to announce support for pipeline-parallel inference by integrating PyTorch's PiPPy framework (so no need to use Megatron or DeepSpeed)! This supports automatic model-weight splitting to each device using a similar API to device_map="auto". This is still under heavy development, however the inference side is stable enough that we are ready for a release. Read more about it in our docs and check out the example zoo.

Requires pippy of version 0.2.0 or later (pip install torchpippy -U)

Example usage (combined with accelerate launch or torchrun):

from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model = prepare_pippy(model, split_points="auto", example_args=(input,))
input = input.to("cuda:0")
with torch.no_grad():
    output = model(input)
# The outputs are only on the final process by default
# You can pass in `gather_outputs=True` to prepare_pippy to
# make them available on all processes
if PartialState().is_last_process:
    output = torch.stack(tuple(output[0]))
    print(output.shape)

DeepSpeed

This release provides support for utilizing DeepSpeed on XPU devices thanks to @faaany

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.26.1...v0.27.0

Jan 11, 2024
v0.26.1: Patch Release

What's Changed

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.26.0...v0.26.1

v0.26.0 - MS-AMP Support, Critical Regression Fixes, and More

Support for MS-AMP

This release adds support for the MS-AMP (Microsoft Automatic Mixed Precision Library) into Accelerate as an alternative backend for doing FP8 training on appropriate hardware. It is the default backend of choice. Read more in the docs here. Introduced in https://github.com/huggingface/accelerate/pull/2232 by @muellerzr

Core

In the prior release a new sampler for the DataLoader was introduced that while across seeds does not show statistical differences in the results, repeating the same seed would result in a different end-accuracy that was scary to some users. We have now disabled this behavior by default as it required some additional setup, and brought back the original implementation. To have the new sampling technique (which can provide more accurate repeated results) pass use_seedable_sampler=True to the Accelerator. We will be propagating this up to the Trainer soon.

Big Model Inference

FSDP and DeepSpeed

Bits and Bytes

Device Agnostic Testing

For developers, we've made it much easier to run the tests on different devices with no change to the code thanks to @statelesshz in https://github.com/huggingface/accelerate/pull/2123 and https://github.com/huggingface/accelerate/pull/2235

Bug Fixes

Major Contributors

  • @statelesshz for their work on device-agnostic testing and NPU support
  • @stas00 for many docfixes when it comes to DeepSpeed and FSDP

General Changelog

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.25.0...v0.26.0

Dec 1, 2023
v0.25.0: safetensors by default, new trackers, and plenty of bug fixes

Safetensors default

As of this release, safetensors will be the default format saved when applicable! To read more about safetensors and why it's best to use it for safety (and not pickle/torch.save), check it out here

New Experiment Trackers

This release has two new experiment trackers, ClearML and DVCLive!

To use them, just pass clear_ml or dvclive to log_with in the Accelerator init. h/t to @eugen-ajechiloae-clearml and @dberenbaum

DeepSpeed

  • Accelerate's DeepSpeed integration now supports NPU devices, h/t to @statelesshz
  • DeepSpeed can now be launched via accelerate on single GPU setups

FSDP

FSDP had a huge refactoring so that the interface when using FSDP is the exact same as every other scenario when using accelerate. No more needing to call accelerator.prepare() twice!

Other useful enhancements

  • We now raise and try to disable P2P communications on consumer GPUs for the 3090 series and beyond. Without this users were seeing timeout issues and the like as NVIDIA dropped P2P support. If using accelerate launch we will automatically disable, and if we sense that it is still enabled on distributed setups using 3090's +, we will raise an error.

  • When doing .gather(), if tensors are on different devices we explicitly will raise an error (for now only valid on CUDA)

Bug fixes

  • Fixed a bug that caused dataloaders to not shuffle despite shuffle=True when using multiple GPUs and the new SeedableRandomSampler.

General Changelog

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.24.1...v0.25.0

Oct 30, 2023
v0.24.1: Patch Release for Samplers
Oct 24, 2023
v0.24.0: Improved Reproducability, Bug fixes, and other Small Improvements

Improved Reproducibility

One critical issue with Accelerate is training runs were different when using an iterable dataset, no matter what seeds were set. v0.24.0 introduces the dataloader.set_epoch() function to all Accelerate DataLoaders, where if the underlying dataset (or sampler) has the ability to set the epoch for reproducability it will do so. This is similar to the implementation already existing in transformers. To use:

dataloader = accelerator.prepare(dataloader)
# Say we want to resume at epoch/iteration 2
dataloader.set_epoch(2)

For more information see this PR, we will update the docs on a subsequent release with more information on this API.

Documentation

  • The quick tour docs have gotten a complete makeover thanks to @MKhalusova. Take a look here
  • We also now have documentation on how to perform multinode training, see the launch docs

Internal structure

  • Shared file systems are now supported under save and save_state via the ProjectConfiguration dataclass. See #1953 for more info.
  • FSDP can now be used for bfloat16 mixed precision via torch.autocast
  • all_gather_into_tensor is now used as the main gather operation, reducing memory in the cases of big tensors
  • Specifying drop_last=True will now properly have the desired affect when performing Accelerator().gather_for_metrics()

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.23.0...v0.24.0

Sep 14, 2023
v0.23.0: Model Memory Estimation tool, Breakpoint API, Multi-Node Notebook Launcher Support, and more!

Model Memory Estimator

A new model estimation tool to help calculate how much memory is needed for inference has been added. This does not download the pretrained weights, and utilizes init_empty_weights to stay memory efficient during the calculation.

Usage directions:

accelerate estimate-memory {model_name} --library {library_name} --dtypes fp16 int8

Or:

from accelerate.commands.estimate import estimate_command_parser, estimate_command, gather_data

parser = estimate_command_parser()
args = parser.parse_args(["bert-base-cased", "--dtypes", "float32"])
output = gather_data(args)

🤗 Hub is a first-class citizen

We've made the huggingface_hub library a first-class citizen of the framework! While this is mainly for the model estimation tool, this opens the doors for further integrations should they be wanted

Accelerator Enhancements:

  • gather_for_metrics will now also de-dupe for non-tensor objects. See #1937
  • mixed_precision="bf16" support on NPU devices. See #1949
  • New breakpoint API to help when dealing with trying to break from a condition on a single process. See #1940

Notebook Launcher Enhancements:

  • The notebook launcher now supports launching across multiple nodes! See #1913

FSDP Enhancements:

DeepSpeed Enhancements:

  • XPU/ccl support (#1827)
  • Easier gradient accumulation support, simply set gradient_accumulation_steps to "auto" in your deepspeed config, and Accelerate will use the one passed to Accelerator instead (#1901)
  • Support for custom schedulers and deepspeed optimizers (#1909)

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.22.0...v0.23.0

Aug 23, 2023
v0.22.0: Distributed operation framework, Gradient Accumulation enhancements, FSDP enhancements, and more!

Experimental distributed operations checking framework

A new framework has been introduced which can help catch timeout errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1" to enable this. Read more in the docs, introduced via https://github.com/huggingface/accelerate/pull/1756

Accelerator.load_state can now load the most recent checkpoint automatically

If a ProjectConfiguration has been made, using accelerator.load_state() (without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via https://github.com/huggingface/accelerate/pull/1741

Multiple enhancements to gradient accumulation

In this release multiple new enhancements to distributed gradient accumulation have been added.

FSDP Changes

DataLoader Changes

What's New?

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @yuxinyuan
    • Support wrapping multiple models in Accelerator.accumulate() (#1708)
    • Fix errors when optimizer is not a Pytorch optimizer. (#1733)
    • Get rid of calling get_scale() by patching the step method of optimizer. (#1720)
  • @NouamaneTazi
    • Better control over DDP's no_sync (#1726)
  • @abhilash1910
    • Add FSDP for XPU (#1803)
    • Ipex bug fix for device properties in modelling (#1834)
  • @statelesshz
    • Add FSDP for NPU (#1806)
    • fix failing test on 8GPU (#1724)
    • fix the bug in npu (#1728)
  • @thevasudevgupta
    • support custom slice function in DataLoaderDispatcher (#1846)

Full Changelog: https://github.com/huggingface/accelerate/compare/v0.21.0...v0.22.0

Jul 13, 2023
v0.21.0: Model quantization and NPUs

Model quantization with bitsandbytes

You can now quantize any model (no just Transformer models) using Accelerate. This is mainly for models having a lot of linear layers. See the documentation for more information!

  • Bnb quantization by @SunMarc in #1626

Support for Ascend NPUs

Accelerate now supports Ascend NPUs.

  • Add Ascend NPU accelerator support by @statelesshz in #1676

What's new?

Accelerate now requires Python 3.8+ and PyTorch 1.10+ :

  • 🚨🚨🚨 Spring cleaning: Python 3.8 🚨🚨🚨 by @muellerzr in #1661

  • 🚨🚨🚨 Spring cleaning: PyTorch 1.10 🚨🚨🚨 by @muellerzr in #1662

  • [doc build] Use secrets by @mishig25 in #1551

  • Update launch.mdx by @LiamSwayne in #1553

  • Avoid double wrapping of all accelerate.prepare objects by @muellerzr in #1555

  • Update README.md by @LiamSwayne in #1556

  • Fix load_state_dict when there is one device and disk by @sgugger in #1557

  • Fix tests not being ran on multi-GPU nightly by @muellerzr in #1558

  • fix the typo when setting the "_accelerator_prepared" attribute by @Yura52 in #1560

  • [core] Fix possibility to passNoneType objects in prepare by @younesbelkada in #1561

  • Reset dataloader end_of_datalaoder at each iter by @sgugger in #1562

  • Update big_modeling.mdx by @LiamSwayne in #1564

  • [bnb] Fix failing int8 tests by @younesbelkada in #1567

  • Update gradient sync docs to reflect importance of optimizer.step() by @dleve123 in #1565

  • Update mixed precision integrations in README by @sgugger in #1569

  • Raise error instead of warn by @muellerzr in #1568

  • Introduce listify, fix tensorboard silently failing by @muellerzr in #1570

  • Check for bak and expand docs on directory structure by @muellerzr in #1571

  • Perminant solution by @muellerzr in #1577

  • fix the bug in xpu by @mingxiaoh in #1508

  • Make sure that we only set is_accelerator_prepared on items accelerate actually prepares by @muellerzr in #1578

  • Expand prepare() doc by @muellerzr in #1580

  • Get Torch version using importlib instead of pkg_resources by @catwell in #1585

  • improve oob performance when use mpirun to start DDP finetune without accelerate launch by @sywangyi in #1575

  • Update training_tpu.mdx by @LiamSwayne in #1582

  • Return false if CUDA available by @muellerzr in #1581

  • fix logger level by @caopulan in #1579

  • Fix test by @muellerzr in #1586

  • Update checkpoint.mdx by @LiamSwayne in #1587

  • FSDP updates by @pacman100 in #1576

  • Update modeling.py by @ain-soph in #1595

  • Integration tests by @muellerzr in #1593

  • Add triggers for CI workflow by @muellerzr in #1597

  • Remove asking xpu plugin for non xpu devices by @abhilash1910 in #1594

  • Remove GPU safetensors env variable by @sgugger in #1603

  • reset end_of_dataloader for dataloader_dispatcher by @megavaz in #1609

  • fix for arc gpus by @abhilash1910 in #1615

  • Ignore low_zero option when only device is available by @sgugger in #1617

  • Fix failing multinode tests by @muellerzr in #1616

  • Doc to md by @sgugger in #1618

  • Fix tb issue by @muellerzr in #1623

  • Fix workflow by @muellerzr in #1625

  • Fix transformers sync bug with accumulate by @muellerzr in #1624

  • fixes offload dtype by @SunMarc in #1631

  • fix: Megatron is not installed. please build it from source. by @yuanwu2017 in #1636

  • deepspeed z2/z1 state_dict bloating fix by @pacman100 in #1638

  • Swap disable rich by @muellerzr in #1640

  • fix autocasting bug by @pacman100 in #1637

  • fix modeling low zero by @abhilash1910 in #1634

  • Add skorch to runners by @muellerzr in #1646

  • add save model by @SunMarc in #1641

  • Change dispatch_model when we have only one device by @SunMarc in #1648

  • Doc save model by @SunMarc in #1650

  • Fix device_map by @SunMarc in #1651

  • Check for port usage before launch by @muellerzr in #1656

  • [BigModeling] Add missing check for quantized models by @younesbelkada in #1652

  • Bump integration by @muellerzr in #1658

  • TIL by @muellerzr in #1657

  • docker cpu py version by @muellerzr in #1659

  • [BigModeling] Final fix for dispatch int8 and fp4 models by @younesbelkada in #1660

  • remove safetensor dep on shard_checkpoint by @SunMarc in #1664

  • change the import place to avoid import error by @pacman100 in #1653

  • Update broken Runhouse link in examples/README.md by @dongreenberg in #1668

  • Bnb quantization by @SunMarc in #1626

  • replace save funct in doc by @SunMarc in #1672

  • Doc big model inference by @SunMarc in #1670

  • Add docs for saving Transformers models by @deppen8 in #1671

  • fix bnb tests by @SunMarc in #1679

  • Fix workflow CI by @muellerzr in #1690

  • remove duplicate class by @SunMarc in #1691

  • update readme in examples by @statelesshz in #1678

  • Fix nightly tests by @muellerzr in #1696

  • Fixup docs by @muellerzr in #1697

  • Improve quality errors by @muellerzr in #1698

  • Move mixed precision wrapping ahead of DDP/FSDP wrapping by @ChenWu98 in #1682

  • Add offload for 8-bit model by @SunMarc in #1699

  • Deepcopy on Accelerator to return self by @muellerzr in #1694

  • Update tracking.md by @stevhliu in #1702

  • Skip tests when bnb isn't available by @muellerzr in #1706

  • Fix launcher validation by @abhilash1910 in #1705

  • Fixes for issue #1683: failed to run accelerate config in colab by @Erickrus in #1692

  • Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait when the dataset is an IterDataPipe during multi-process training. by @yuxinyuan in #1709

  • add multi_gpu decorator by @SunMarc in #1712

  • Modify loading checkpoint behavior by @SunMarc in #1715

  • fix version by @SunMarc in #1701

  • Keep old behavior by @muellerzr in #1716

  • Optimize get_scale to reduce async calls by @muellerzr in #1718

  • Remove duplicate code by @muellerzr in #1717

  • New tactic by @muellerzr in #1719

  • add Comfy-UI by @pacman100 in #1723

  • add compatibility with peft by @SunMarc in #1725

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @LiamSwayne
    • Update launch.mdx (#1553)
    • Update README.md (#1556)
    • Update big_modeling.mdx (#1564)
    • Update training_tpu.mdx (#1582)
    • Update checkpoint.mdx (#1587)
  • @mingxiaoh
    • fix the bug in xpu (#1508)
  • @statelesshz
    • update readme in examples (#1678)
    • Add Ascend NPU accelerator support (#1676)
  • @ChenWu98
    • Move mixed precision wrapping ahead of DDP/FSDP wrapping (#1682)
Latest
v1.13.0
Tracking Since
Mar 5, 2021
Last fetched Apr 18, 2026