NoneType objects in prepare in #1561 by @younesbelkadaSupport has been added to run device_map="auto" on the MPS device. Big model inference also work with models loaded in 4 bits in Transformers.
This version introduces a new Accelerator.split_between_processes utility to help with performing distributed infernece with non-tensorized or non-dataloader workflows. Read more here
LocalSGDlogging_dir has been fully deprecated, please use project_dir or a Project_configuration
core] Introducing CustomDtype enum for custom dtypes by @younesbelkada in #1434state.rank -> process_index by @pcuenca in #1450in_order argument that defaults to False, to log in order. by @JulesGM in #1262register_empty_buffer to match torch args by @NouamaneTazi in #1465split_between_processes by @muellerzr in #1477bnb] Add fp4 support for dispatch by @younesbelkada in #1505The following contributors have made significant changes to the library over the last release:
Trainer, keep an eye on the repos to see how our progress is coming along!wandb integration now supports logging of images and tables through tracker.log_images and tracker.log_tables respectivelycore] Add Quantization support for dispatch_model by @younesbelkada in https://github.com/huggingface/accelerate/pull/1237has_transfomer_engine_layers by @muellerzr in https://github.com/huggingface/accelerate/pull/1283recursively_apply by @muellerzr in https://github.com/huggingface/accelerate/pull/1286bnb] fix bnb slow test by @younesbelkada in https://github.com/huggingface/accelerate/pull/1292notebook_launcher by @muellerzr in https://github.com/huggingface/accelerate/pull/1293bnb] Fix bnb slow test by @younesbelkada in https://github.com/huggingface/accelerate/pull/1355| by @kiyoon in https://github.com/huggingface/accelerate/pull/1363accelerate env reporting by @muellerzr in https://github.com/huggingface/accelerate/pull/1376Full Changelog: https://github.com/huggingface/accelerate/compare/v0.18.0...v0.19.0
GradientAccumulationPlugin has been added to handle more configurations with the GradientState. Specifically you can optionally disable having Accelerate automatically adjust the length of the scheduler relative to gradient accumulation steps through it. Otherwise Accelerate will now automatically handle ensuring that the schedulers built for non-gradient accumulation will work during gradient accumulationdynamo_backend warning has been silenced.drop_last on linear layers, tied weight loading, and handling of multiple tied parametersfind_tied_parameters now deals with groups of tied parameters (instead of only pairs of them). As a result it now returns a list of list of strings instead of a dictionary.use_orig_params to FullyShardedDataParallelPlugin by @pacman100 in https://github.com/huggingface/accelerate/pull/1184to on modules that wraps accelerate loaded models by @younesbelkada in https://github.com/huggingface/accelerate/pull/1172Full Changelog: https://github.com/huggingface/accelerate/compare/v0.17.1...v0.18.0
This release fully supports the upcoming PyTorch 2.0 release. You can choose to use torch.compile or not and then customize the options in accelerate.config or via a TorchDynamoPlugin.
This release adds a new PartialState, which contains most of the capabilities of the AcceleratorState however it is designed to be used by the user to assist in any process control mechanisms around it. With this, users also now do not need to have if accelerator.state.is_main_process when utilizing classes such as the Tracking API, as these now will automatically use only the main process for their work by default.
Launching from TPU pods is now supported, please see this issue for more information
accelerate launch by @muellerzr in #1049This release adds experimental support for FP8 mixed precision training, which requires the transformer-engine library as well as a Hopper GPU (or higher).
mps device by default and removing related config by @pacman100 in #1030cpu_offload_with_hook by @sgugger in #1045hidden_size auto value default fixes by @pacman100 in #1060PartialState first class citizen by @muellerzr in #1071additional_args, allowing more flexible configuration and env variable support by @dbpprt in #1113launch for greater extensibility by @Yard1 in #1123torch.distributed module by @mfuntowicz in #1108Accelerator] Fix issue with 8bit models by @younesbelkada in #1155total_limit by @muellerzr in #1165The following contributors have made significant changes to the library over the last release:
launch for greater extensibility (#1123)A new interactive tool has been introduced to the documentation to help users quickly learn how to utilize features of the framework before providing more details on them as shown below:
Not only does it provide a code diff, but it also includes an explanation and links to more resources the user should check out to learn more:
Try it out today in the docs
When resuming training, you can more efficiently skip batches in your dataloader with the new skip_first_batches function (also available as a method on your Accelerator).
A new ZeRO-3 init context manager is added to provide granular control to users in situations involving nested/multiple models. Refactoring of DeepSpeed Config file support to remove ambiguity between it and Accelerate config.
Adding support for auto entries in the DeeSpeed config file to be filled via the accelerate launch command. Try it out today by referring to the section Things to note when using DeepSpeed Config File
deepspeed_config_file by @pacman100 in #941project_dir and limit the number of saved checkpoints by @muellerzr in #916init_on_device by @thomasw21 in #926ACCELERATE_ by @pacman100 in #928load_checkpoint by @sgugger in #920mixed_precision_type property to AcceleratorState by @pacman100 in #935deepspeed_config_file by @pacman100 in #941load_state by @pacman100 in #989We are very excited by the newly announced PyTorch 2.0 stack and you can try it using Accelerate on any model by using the dynamo_backend argument of the Accelerator, or when filling your config with accelerate config.
Note that to get the best performance, we recommend:
accelerate config update and accelerate config default. The first will update a config file to have the latest keys added from latter releases of Accelerate, and the second will create a default configuration file automatically mimicking write_default_config() introduced in #851 and #853 by @muellerzraccelerate launch which will show options relevant to the choices shown, such as accelerate launch --multi_gpu will show launch parameters relevant to multi-gpu training.join_uneven_inputs context manager to Accelerator by @Chris-hughes10 in #820default-config command by @muellerzr in #840batch_size by @pacman100 in #861The following contributors have made significant changes to the library over the last release:
join_uneven_inputs context manager to Accelerator (#820)Accelerate now supports Megatron-LM for the three model classes (BERT, GPT-2 and T5). You can learn more in the documentation.
Fixes a bug that returned SIGKILL errors on Windows.
notebook_launcherWith Kaggle now giving instances with two T4 GPUs, Accelerate can leverage this to do multi-gpu training from the notebook
non_blocking kwarg to send_to_device() by @NouamaneTazi in #607infer_auto_device_map by @younesbelkada in #792even_batches keyword to Accelerator by @Chris-hughes10 in #781AcceleratedOptimizer by @pacman100 in #811recurse argument in remove_hook_from_module by @younesbelkada in #812The following contributors have made significant changes to the library over the last release:
even_batches keyword to Accelerator (#781)The accelerate command launch did not work well for distributed training using several machines. This is fixed in this version.
Instead of prefixing your launch command with CUDA_VISIBLE_DEVICES=xxx you can now specify the GPUs you want to use in your Accelerate config.
The tracebacks are now cleaned up to avoid printing several times the same error, and rich is integrated as an optional dependency.
subprocess from the multi-gpu launcher by @muellerzr in #623notebook_launcher by @pacman100 in #695init_empty_weights to override tensor constructor by @thomasw21 in #699grad_acc_steps from accelerator obj by @pacman100 in #698utils readability fixups by @ryanrussell in #711key_occurrence readability fixup by @ryanrussell in #710hooks readability improvements by @ryanrussell in #712The whole documentation has been revamped, just go look at it here!
When doing distributed evaluation, the dataloader loops back at the beginning of the dataset to make batches that have a round multiple of the number of processes. This causes the predictions to be slightly bigger than the length of the dataset, which used to require some truncating. This is all done behind the scenes now if you replace the gather your did in evaluation by gather_for_metrics.
When loading big models for inference, device_map="auto" used to fill the GPUs sequentially, making it hard to use a batch size > 1. It now balances the weights evenly on the GPUs so if you have more GPU space than the model size, you can do predictions with a bigger batch size!
Accelerate now supports M1 GPUs, to learn more about how to setup your environment, see the documentation.
mps device integration by @pacman100 in #596.run in WandBTracker. by @zh-plus in #605set_module_tensor_to_device by @sgugger in #576datasets by @lhoestq in #563accelerate launch by @muellerzr in #5530.6.7 fix by @pacman100 in #544The following contributors have made significant changes to the library over the last release:
Accelerate now handles gradient accumulation if you want, just pass along gradient_accumulation_steps=xxx when instantiating the Accelerator and put all your training loop step under a with accelerator.accumulate(model):. Accelerate will then handle the loss re-scaling and gradient accumulation for you (avoiding slowdowns in distributed training when gradients only need to be synced when you want to step). More details in the documentation.
Accelerate now support SageMaker specific brand of data parallelism.
split_batches=True by @sgugger in #509total_batch_size attribute by @pacman100 in #493This release adds two major new features: the DeepSpeed integration has been revamped to match the one in Transformers Trainer, with multiple new options unlocked, and the TPU integration has been sped up.
This version also officially stops supporting Python 3.6 and requires Python 3.7+
Users can now specify a DeepSpeed config file when they want to use DeepSpeed, which unlocks many new options. More details in the new documentation.
If you're using TPUs we have sped up the dataloaders and models quite a bit, on top of a few bug fixes.
no_sync context wrapper + clean up some more warnings for DDP by @muellerzr in #428This release offers no significant new API, it is just needed to have access to some utils in Transformers.
To handle very large models, new functionality has been added in Accelerate:
load_checkpoint_and_dispatch)See more in the documentation