safetensors version 0.4.3.numpy 2.0.0accelerate library will handle this automatically with accelerator.end_training(), or you can do it manually using PartialState().destroy_process_group().transfer_to_npu, ensuring better performance and compatibility.StatefulDataLoader from torchdata, allowing better handling of data loading states. Enable by passing use_stateful_dataloader=True to the DataLoaderConfiguration, and when calling load_state() the DataLoader will automatically be resumed from its last step, no more having to iterate through passed batches.prepare_data_loader() function is now independent of the Accelerator, giving you more flexibility towards which API levels you would like to use.DataLoader states, ensuring smoother training sessions.set_epoch function for MpDeviceLoaderWrapper.TransformerEngine FP8 training, including better defaults for the quantized FP8 weights.TransformerEngine integration works exactly as intended. These scripts run one half using 🤗 Accelerate's integration, the other with raw TransformersEngine, providing users with a nice example of what we do under the hood with accelerate, and a good sanity check to make sure nothing breaks down over time. Find them hereTransformerEngine and accelerate as well. Use docker pull huggingface/accelerate@gpu-fp8-transformerengine to quickly get an environment going.torchpippy no more, long live torch.distributed.pipeliningtorchpippy is now fully integrated into torch core, and as a result we are exclusively supporting the PyTorch implementation from now on[1, n, n] rather than [2, n, n] as before.pipelining no longer supports encoder/decoder models, so the t5 example has been removed.torchpippy potentially if needed.FullyShardedDataParallelPlugin yourself manually with no need for environment patching:from accelerate import FullyShardedDataParallelPlugin
fsdp_plugin = FullyShardedDataParallelPlugin(...)
accelerate launch and need to ensure the env variables are setup properly for model loading:from accelerate.utils import enable_fsdp_ram_efficient_loading, disable_fsdp_ram_efficient_loading
enable_fsdp_ram_efficient_loading()
axolotl library, so very big kudos to their wonderful workstep when loading the state by @muellerzr in https://github.com/huggingface/accelerate/pull/2992find_tied_params for models with shared layers by @qubvel in https://github.com/huggingface/accelerate/pull/2986transformer_engine on import by @oraluben in https://github.com/huggingface/accelerate/pull/3056skip_first_batches support for StatefulDataloader and fix all the tests by @muellerzr in https://github.com/huggingface/accelerate/pull/3068step when loading the state by @muellerzr in https://github.com/huggingface/accelerate/pull/2992find_tied_params for models with shared layers by @qubvel in https://github.com/huggingface/accelerate/pull/2986end_training by @SunMarc in https://github.com/huggingface/accelerate/pull/3012torchdata.stateful_dataloader.StatefulDataLoader within the Accelerator by @byi8220 in https://github.com/huggingface/accelerate/pull/2895prepare_data_loader() from Accelerator by @siddk in https://github.com/huggingface/accelerate/pull/3047transformer_engine on import by @oraluben in https://github.com/huggingface/accelerate/pull/3056skip_first_batches support for StatefulDataloader and fix all the tests by @muellerzr in https://github.com/huggingface/accelerate/pull/3068Small release this month, with key focuses on some added support for backends and bugs:
torch.float8_e4m3fn format dtype_byte_size by @SunMarc in https://github.com/huggingface/accelerate/pull/2945device_map="auto" by @muellerzr in https://github.com/huggingface/accelerate/pull/2914multi_gpu was being set and warning being printed even with num_processes=1 by @HarikrishnanBalagopal in https://github.com/huggingface/accelerate/pull/2921pip caching in CI by @SauravMaheshkar in https://github.com/huggingface/accelerate/pull/2952Full Changelog: https://github.com/huggingface/accelerate/compare/v0.32.1...v0.33.0
huggingface_hub rather than our own implementation (https://github.com/huggingface/accelerate/pull/2795)dispatch_model (https://github.com/huggingface/accelerate/pull/2855)Accelerator.step number is now restored when using save_state and load_state (https://github.com/huggingface/accelerate/pull/2765)import accelerate and any other major core import by 68%, now should be only slightly longer than doing import torch (https://github.com/huggingface/accelerate/pull/2845)get_backend and added a clear_device_cache utility (https://github.com/huggingface/accelerate/pull/2857)allreduce. (https://github.com/huggingface/accelerate/pull/2841)log_line_prefix_template optional the notebook_launcher (https://github.com/huggingface/accelerate/pull/2888)accelerate merge-weights, one will be automatically created (https://github.com/huggingface/accelerate/pull/2854).safetensors (https://github.com/huggingface/accelerate/pull/2853)torch>=2.4 (https://github.com/huggingface/accelerate/pull/2825)@require_triton test decorator and enable test_dynamo work on xpu (https://github.com/huggingface/accelerate/pull/2878)load_state_dict not working on xpu and refine xpu safetensors version check (https://github.com/huggingface/accelerate/pull/2879)accelerate launch (https://github.com/huggingface/accelerate/pull/2902)dispatch_model by @panjd123 in https://github.com/huggingface/accelerate/pull/2855test_tracking.ClearMLTest by @faaany in https://github.com/huggingface/accelerate/pull/2863torch_device instead of 0 for device check by @faaany in https://github.com/huggingface/accelerate/pull/2861test_zero3_integration by @faaany in https://github.com/huggingface/accelerate/pull/2864log_line_prefix_template Optional in Elastic Launcher for Backward Compatibility by @yhna940 in https://github.com/huggingface/accelerate/pull/2888require_triton and enable test_dynamo work on xpu by @faaany in https://github.com/huggingface/accelerate/pull/2878load_state_dict for xpu and refine xpu safetensor version check by @faaany in https://github.com/huggingface/accelerate/pull/2879Full Changelog: https://github.com/huggingface/accelerate/compare/v0.31.0...v0.32.0
timeout default to PyTorch defaults based on backend by @muellerzr in https://github.com/huggingface/accelerate/pull/2758notebook_launcher by @yhna940 in https://github.com/huggingface/accelerate/pull/2788logging to log the actual user call site (instead of the call site inside the logger wrapper) of log functions by @luowyang in https://github.com/huggingface/accelerate/pull/2730notebook_launcher by @yhna940 in https://github.com/huggingface/accelerate/pull/2788get_balanced_memory by @faaany in https://github.com/huggingface/accelerate/pull/2826stage3_prefetch_bucket_size value to an integer by @adk9 in https://github.com/huggingface/accelerate/pull/2814Full Changelog: https://github.com/huggingface/accelerate/compare/v0.30.1...v0.31.0
Full Changelog: https://github.com/huggingface/accelerate/compare/v0.30.0...v0.30.1
tqdm wrapper to make it fully passthrough, no need to have tqdm(main_process_only, *args), it is now just tqdm(*args) and you can pass in is_main_process as a kwarg.cann version info to command accelerate env for NPU by @statelesshz in https://github.com/huggingface/accelerate/pull/2689deepspeed-specific Docker image by @muellerzr in https://github.com/huggingface/accelerate/pull/2707. To use, pull the gpu-deepspeed tag docker pull huggingface/accelerate:cuda-deepspeed-nightlyis_train_batch_min type in DeepSpeedPlugin by @yhna940 in https://github.com/huggingface/accelerate/pull/2646free_memory to deal with garbage collection by @muellerzr in https://github.com/huggingface/accelerate/pull/2716execution_device by @faaany in https://github.com/huggingface/accelerate/pull/2612is_train_batch_min type in DeepSpeedPlugin by @yhna940 in https://github.com/huggingface/accelerate/pull/2646Repository anymore by @Wauplin in https://github.com/huggingface/accelerate/pull/2658tqdm: *args should come ahead of main_process_only by @rb-synth in https://github.com/huggingface/accelerate/pull/2654free_memory to deal with garbage collection by @muellerzr in https://github.com/huggingface/accelerate/pull/2716Full Changelog: https://github.com/huggingface/accelerate/compare/v0.29.3...v0.30.0
load_checkpoint_and_dispatch needs a strict argumentFull Changelog: https://github.com/huggingface/accelerate/compare/v0.29.2...v0.29.3
Fixed an import which would cause running accelerate CLI to fail if pytest wasn't installed
accelerate config, set the ACCELERATE_CPU_AFFINITY=1 env variable, or manually using the following:from accelerate.utils import set_numa_affinity
# For GPU 0
set_numa_affinity(0)
Big thanks to @stas00 for the recommendation, request, and feedback during development
set_seed by @muellerzr in https://github.com/huggingface/accelerate/pull/2569BatchSamplerShard by @universuen in https://github.com/huggingface/accelerate/pull/2584notebook_launcher can use multiple GPUs in Google Colab if using a custom instance that supports multiple GPUs by @StefanTodoran in https://github.com/huggingface/accelerate/pull/2561load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in https://github.com/huggingface/accelerate/pull/2588main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in https://github.com/huggingface/accelerate/pull/2495deepspeed, set it with the DS_ENV_FILE environmental variable by @muellerzr in https://github.com/huggingface/accelerate/pull/2566main_process_ip and master_addr when not using standard as deepspeed launcher by @asdfry in https://github.com/huggingface/accelerate/pull/2495load_checkpoint_in_model behavior when unexpected keys are in the checkpoint by @fxmarty in https://github.com/huggingface/accelerate/pull/2588Full Changelog: https://github.com/huggingface/accelerate/compare/v0.28.0...v0.29.0
DataLoaderConfiguration and begin deprecation of arguments in the Accelerator+from accelerate import DataLoaderConfiguration
+dl_config = DataLoaderConfiguration(split_batches=True, dispatch_batches=True)
-accelerator = Accelerator(split_batches=True, dispatch_batches=True)
+accelerator = Accelerator(dataloader_config=dl_config)
from accelerate import GradientAccumulationPlugin
plugin = GradientAccumulationPlugin(
+ num_steps=2,
sync_each_batch=sync_each_batch
)
accelerator = Accelerator(gradient_accumulation_plugin=plugin)
launch changesmpirun for multi-cpu training by @dmsuehir in https://github.com/huggingface/accelerate/pull/2493is_torch_tensor over hasattr for torch.compile. by @PhilJd in https://github.com/huggingface/accelerate/pull/2387DataLoaderConfig by @muellerzr in https://github.com/huggingface/accelerate/pull/2441is_namedtuple implementation by @fxmarty in https://github.com/huggingface/accelerate/pull/2475os.path.sep.join path manipulations with a helper by @akx in https://github.com/huggingface/accelerate/pull/2446XLA device type by @will-cromar in https://github.com/huggingface/accelerate/pull/2467Accelerator to detect distributed type from the "LOCAL_RANK" env variable for XPU by @faaany in https://github.com/huggingface/accelerate/pull/2473accelerate launch by @muellerzr in https://github.com/huggingface/accelerate/pull/2498----main_process_port to --main_process_port) by @DerrickWang005 in https://github.com/huggingface/accelerate/pull/2516PYTORCH_NVML_BASED_CUDA_CHECK when calling accelerate.utils.imports.is_cuda_available() by @luiscape in https://github.com/huggingface/accelerate/pull/2524env=os.environ.copy()s by @akx in https://github.com/huggingface/accelerate/pull/2449zero_grad(set_to_none=None) to align with PyTorch by @yongchanghao in https://github.com/huggingface/accelerate/pull/2472Full Changelog: https://github.com/huggingface/accelerate/compare/v0.27.2...v0.28.0
With the latest release of PyTorch 2.2.0, we've guaranteed that there are no breaking changes regarding it
With this release we are excited to announce support for pipeline-parallel inference by integrating PyTorch's PiPPy framework (so no need to use Megatron or DeepSpeed)! This supports automatic model-weight splitting to each device using a similar API to device_map="auto". This is still under heavy development, however the inference side is stable enough that we are ready for a release. Read more about it in our docs and check out the example zoo.
Requires pippy of version 0.2.0 or later (pip install torchpippy -U)
Example usage (combined with accelerate launch or torchrun):
from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model = prepare_pippy(model, split_points="auto", example_args=(input,))
input = input.to("cuda:0")
with torch.no_grad():
output = model(input)
# The outputs are only on the final process by default
# You can pass in `gather_outputs=True` to prepare_pippy to
# make them available on all processes
if PartialState().is_last_process:
output = torch.stack(tuple(output[0]))
print(output.shape)
This release provides support for utilizing DeepSpeed on XPU devices thanks to @faaany
dispatch_model, and in forward with offloading by @fxmarty in https://github.com/huggingface/accelerate/pull/2330accelerate config by @faaany in https://github.com/huggingface/accelerate/pull/2346block_size picking in megatron_lm_gpt_pretraining example. by @nilq in https://github.com/huggingface/accelerate/pull/2342FP8RecipeKwargs by @sudhakarsingh27 in https://github.com/huggingface/accelerate/pull/2355add_hook_to_module and remove_hook_from_module compatibility with fx.GraphModule by @fxmarty in https://github.com/huggingface/accelerate/pull/2369requires_grad to kwargs when registering empty parameters. by @BlackSamorez in https://github.com/huggingface/accelerate/pull/2376adapter_only option to save_fsdp_model and load_fsdp_model to only save/load PEFT weights by @AjayP13 in https://github.com/huggingface/accelerate/pull/2321split_batches by @izhx in https://github.com/huggingface/accelerate/pull/2344nproc_per_node in the multi gpu test by @faaany in https://github.com/huggingface/accelerate/pull/2422Accelerator to prepare models in eval mode for XPU&CPU by @faaany in https://github.com/huggingface/accelerate/pull/2426Full Changelog: https://github.com/huggingface/accelerate/compare/v0.26.1...v0.27.0
dispatch_batches=True by @SunMarc in https://github.com/huggingface/accelerate/pull/2325Full Changelog: https://github.com/huggingface/accelerate/compare/v0.26.0...v0.26.1
This release adds support for the MS-AMP (Microsoft Automatic Mixed Precision Library) into Accelerate as an alternative backend for doing FP8 training on appropriate hardware. It is the default backend of choice. Read more in the docs here. Introduced in https://github.com/huggingface/accelerate/pull/2232 by @muellerzr
In the prior release a new sampler for the DataLoader was introduced that while across seeds does not show statistical differences in the results, repeating the same seed would result in a different end-accuracy that was scary to some users. We have now disabled this behavior by default as it required some additional setup, and brought back the original implementation. To have the new sampling technique (which can provide more accurate repeated results) pass use_seedable_sampler=True to the Accelerator. We will be propagating this up to the Trainer soon.
device_map we've made it possible to not returned grouped key results if desired in https://github.com/huggingface/accelerate/pull/2233device_map="cuda" etc thanks to @younesbelkada in https://github.com/huggingface/accelerate/pull/2254Many improvements to the docs have been made thanks to @stass. Along with this we've made it easier to adjust the config for the sharding strategy and other config values thanks to @pacman100 in https://github.com/huggingface/accelerate/pull/2288
A regression in Accelerate 0.23.0 occurred that showed learning is much slower on multi-GPU setups compared to a single GPU. https://github.com/huggingface/accelerate/pull/2304 has now fixed this thanks to @pacman100
The DeepSpeed integration now also handles auto values better when making a configuration in https://github.com/huggingface/accelerate/pull/2313
Params4bit added to bnb classes in set_module_tensor_to_device() by @poedator in https://github.com/huggingface/accelerate/pull/2315For developers, we've made it much easier to run the tests on different devices with no change to the code thanks to @statelesshz in https://github.com/huggingface/accelerate/pull/2123 and https://github.com/huggingface/accelerate/pull/2235
offload_state_dict=True and dtype is specified by @fxmarty in https://github.com/huggingface/accelerate/pull/2116auto values for comm buffers by @stas00 in https://github.com/huggingface/accelerate/pull/2295offload_state_dict=True and dtype is specified by @fxmarty in https://github.com/huggingface/accelerate/pull/2116Big-Modeling] Harmonize device check to handle corner cases by @younesbelkada in https://github.com/huggingface/accelerate/pull/2254log_images for aim tracker by @Justin900429 in https://github.com/huggingface/accelerate/pull/2257check_tied_parameters_on_same_device by @SunMarc in https://github.com/huggingface/accelerate/pull/2218auto values for comm buffers by @stas00 in https://github.com/huggingface/accelerate/pull/2295prepare_data_loader by @izhx in https://github.com/huggingface/accelerate/pull/2310Params4bit added to bnb classes in set_module_tensor_to_device() by @poedator in https://github.com/huggingface/accelerate/pull/2315Full Changelog: https://github.com/huggingface/accelerate/compare/v0.25.0...v0.26.0
As of this release, safetensors will be the default format saved when applicable! To read more about safetensors and why it's best to use it for safety (and not pickle/torch.save), check it out here
This release has two new experiment trackers, ClearML and DVCLive!
To use them, just pass clear_ml or dvclive to log_with in the Accelerator init. h/t to @eugen-ajechiloae-clearml and @dberenbaum
FSDP had a huge refactoring so that the interface when using FSDP is the exact same as every other scenario when using accelerate. No more needing to call accelerator.prepare() twice!
We now raise and try to disable P2P communications on consumer GPUs for the 3090 series and beyond. Without this users were seeing timeout issues and the like as NVIDIA dropped P2P support. If using accelerate launch we will automatically disable, and if we sense that it is still enabled on distributed setups using 3090's +, we will raise an error.
When doing .gather(), if tensors are on different devices we explicitly will raise an error (for now only valid on CUDA)
shuffle=True when using multiple GPUs and the new SeedableRandomSampler.save as False by @muellerzr in https://github.com/huggingface/accelerate/pull/2138launch, and pick up in state if a user will face issues. by @muellerzr in https://github.com/huggingface/accelerate/pull/2195Full Changelog: https://github.com/huggingface/accelerate/compare/v0.24.1...v0.25.0
One critical issue with Accelerate is training runs were different when using an iterable dataset, no matter what seeds were set. v0.24.0 introduces the dataloader.set_epoch() function to all Accelerate DataLoaders, where if the underlying dataset (or sampler) has the ability to set the epoch for reproducability it will do so. This is similar to the implementation already existing in transformers. To use:
dataloader = accelerator.prepare(dataloader)
# Say we want to resume at epoch/iteration 2
dataloader.set_epoch(2)
For more information see this PR, we will update the docs on a subsequent release with more information on this API.
save and save_state via the ProjectConfiguration dataclass. See #1953 for more info.bfloat16 mixed precision via torch.autocastall_gather_into_tensor is now used as the main gather operation, reducing memory in the cases of big tensorsdrop_last=True will now properly have the desired affect when performing Accelerator().gather_for_metrics()dispatch_model by @austinapatel in https://github.com/huggingface/accelerate/pull/1971save and save_state via ProjectConfiguration by @muellerzr in https://github.com/huggingface/accelerate/pull/1953torch.autocast for bfloat16 mixed precision by @brcps12 in https://github.com/huggingface/accelerate/pull/2033all_gather_into_tensor by @muellerzr in https://github.com/huggingface/accelerate/pull/1968gather_for_metrics by @muellerzr in https://github.com/huggingface/accelerate/pull/2048Full Changelog: https://github.com/huggingface/accelerate/compare/v0.23.0...v0.24.0
A new model estimation tool to help calculate how much memory is needed for inference has been added. This does not download the pretrained weights, and utilizes init_empty_weights to stay memory efficient during the calculation.
Usage directions:
accelerate estimate-memory {model_name} --library {library_name} --dtypes fp16 int8
Or:
from accelerate.commands.estimate import estimate_command_parser, estimate_command, gather_data
parser = estimate_command_parser()
args = parser.parse_args(["bert-base-cased", "--dtypes", "float32"])
output = gather_data(args)
We've made the huggingface_hub library a first-class citizen of the framework! While this is mainly for the model estimation tool, this opens the doors for further integrations should they be wanted
Accelerator Enhancements:gather_for_metrics will now also de-dupe for non-tensor objects. See #1937mixed_precision="bf16" support on NPU devices. See #1949breakpoint API to help when dealing with trying to break from a condition on a single process. See #1940torch.compile support was fixed. See #1919gradient_accumulation_steps to "auto" in your deepspeed config, and Accelerate will use the one passed to Accelerator instead (#1901)accelerate config on npu by @statelesshz in https://github.com/huggingface/accelerate/pull/1895Tests] Finish all todos by @younesbelkada in https://github.com/huggingface/accelerate/pull/1957force_hooks to dispatch_model by @austinapatel in https://github.com/huggingface/accelerate/pull/1969Full Changelog: https://github.com/huggingface/accelerate/compare/v0.22.0...v0.23.0
A new framework has been introduced which can help catch timeout errors caused by distributed operations failing before they occur. As this adds a tiny bit of overhead, it is an opt-in scenario. Simply run your code with ACCELERATE_DEBUG_MODE="1" to enable this. Read more in the docs, introduced via https://github.com/huggingface/accelerate/pull/1756
Accelerator.load_state can now load the most recent checkpoint automaticallyIf a ProjectConfiguration has been made, using accelerator.load_state() (without any arguments passed) can now automatically find and load the latest checkpoint used, introduced via https://github.com/huggingface/accelerate/pull/1741
In this release multiple new enhancements to distributed gradient accumulation have been added.
accelerator.accumulate() now supports passing in multiple models introduced via https://github.com/huggingface/accelerate/pull/1708.backward() via https://github.com/huggingface/accelerate/pull/1726DataLoaderDispatcher added via https://github.com/huggingface/accelerate/pull/1846no_sync by @NouamaneTazi in https://github.com/huggingface/accelerate/pull/1726get_scale() by patching the step method of optimizer. by @yuxinyuan in https://github.com/huggingface/accelerate/pull/1720set_module_tensor_to_device. by @Narsil in https://github.com/huggingface/accelerate/pull/1731__repr__ of AlignDevicesHook by @KacperWyrwal in https://github.com/huggingface/accelerate/pull/1735KwargsHandler.to_kwargs not working with os.environ initialization in __post_init__ by @CyCle1024 in https://github.com/huggingface/accelerate/pull/1738autocast kwargs and simplify autocast wrapper by @muellerzr in https://github.com/huggingface/accelerate/pull/1740Accelerator.save_state using multi-gpu by @CyCle1024 in https://github.com/huggingface/accelerate/pull/1760max_memory argument is in unexpected order by @ranchlai in https://github.com/huggingface/accelerate/pull/1759is_aim_available() function to not match aim >= 4.0.0 by @alberttorosyan in https://github.com/huggingface/accelerate/pull/1769load_fsdp_optimizer by @awgu in https://github.com/huggingface/accelerate/pull/1755torch.distributed is disabled by @natsukium in https://github.com/huggingface/accelerate/pull/1800get_balanced_memory to avoid OOM by @ranchlai in https://github.com/huggingface/accelerate/pull/1798convert_file_size_to_int by @ranchlai in https://github.com/huggingface/accelerate/pull/1799allow_val_change by @SumanthRH in https://github.com/huggingface/accelerate/pull/1796gather_for_metrics by @dleve123 in https://github.com/huggingface/accelerate/pull/1784load_and_quantize_model arg by @JonathanRayner in https://github.com/huggingface/accelerate/pull/1822init_on_device by @shingjan in https://github.com/huggingface/accelerate/pull/1826unwrap_model and keep_fp32_wrapper=False by @BenjaminBossan in https://github.com/huggingface/accelerate/pull/1838verify_device_map by @Rexhaif in https://github.com/huggingface/accelerate/pull/1842gpu_ids (Rel. Issue #1848) by @devymex in https://github.com/huggingface/accelerate/pull/1850fsdp_with_peak_mem_tracking.py by @pacman100 in https://github.com/huggingface/accelerate/pull/1856init_on_device by @shingjan in https://github.com/huggingface/accelerate/pull/1852DataLoaderDispatcher by @thevasudevgupta in https://github.com/huggingface/accelerate/pull/1846The following contributors have made significant changes to the library over the last release:
Accelerator.accumulate() (#1708)no_sync (#1726)DataLoaderDispatcher (#1846)Full Changelog: https://github.com/huggingface/accelerate/compare/v0.21.0...v0.22.0
You can now quantize any model (no just Transformer models) using Accelerate. This is mainly for models having a lot of linear layers. See the documentation for more information!
Accelerate now supports Ascend NPUs.
Accelerate now requires Python 3.8+ and PyTorch 1.10+ :
🚨🚨🚨 Spring cleaning: Python 3.8 🚨🚨🚨 by @muellerzr in #1661
🚨🚨🚨 Spring cleaning: PyTorch 1.10 🚨🚨🚨 by @muellerzr in #1662
[doc build] Use secrets by @mishig25 in #1551
Update launch.mdx by @LiamSwayne in #1553
Avoid double wrapping of all accelerate.prepare objects by @muellerzr in #1555
Update README.md by @LiamSwayne in #1556
Fix load_state_dict when there is one device and disk by @sgugger in #1557
Fix tests not being ran on multi-GPU nightly by @muellerzr in #1558
fix the typo when setting the "_accelerator_prepared" attribute by @Yura52 in #1560
[core] Fix possibility to passNoneType objects in prepare by @younesbelkada in #1561
Reset dataloader end_of_datalaoder at each iter by @sgugger in #1562
Update big_modeling.mdx by @LiamSwayne in #1564
[bnb] Fix failing int8 tests by @younesbelkada in #1567
Update gradient sync docs to reflect importance of optimizer.step() by @dleve123 in #1565
Update mixed precision integrations in README by @sgugger in #1569
Raise error instead of warn by @muellerzr in #1568
Introduce listify, fix tensorboard silently failing by @muellerzr in #1570
Check for bak and expand docs on directory structure by @muellerzr in #1571
Perminant solution by @muellerzr in #1577
fix the bug in xpu by @mingxiaoh in #1508
Make sure that we only set is_accelerator_prepared on items accelerate actually prepares by @muellerzr in #1578
Expand prepare() doc by @muellerzr in #1580
Get Torch version using importlib instead of pkg_resources by @catwell in #1585
improve oob performance when use mpirun to start DDP finetune without accelerate launch by @sywangyi in #1575
Update training_tpu.mdx by @LiamSwayne in #1582
Return false if CUDA available by @muellerzr in #1581
fix logger level by @caopulan in #1579
Fix test by @muellerzr in #1586
Update checkpoint.mdx by @LiamSwayne in #1587
FSDP updates by @pacman100 in #1576
Update modeling.py by @ain-soph in #1595
Integration tests by @muellerzr in #1593
Add triggers for CI workflow by @muellerzr in #1597
Remove asking xpu plugin for non xpu devices by @abhilash1910 in #1594
Remove GPU safetensors env variable by @sgugger in #1603
reset end_of_dataloader for dataloader_dispatcher by @megavaz in #1609
fix for arc gpus by @abhilash1910 in #1615
Ignore low_zero option when only device is available by @sgugger in #1617
Fix failing multinode tests by @muellerzr in #1616
Doc to md by @sgugger in #1618
Fix tb issue by @muellerzr in #1623
Fix workflow by @muellerzr in #1625
Fix transformers sync bug with accumulate by @muellerzr in #1624
fixes offload dtype by @SunMarc in #1631
fix: Megatron is not installed. please build it from source. by @yuanwu2017 in #1636
deepspeed z2/z1 state_dict bloating fix by @pacman100 in #1638
Swap disable rich by @muellerzr in #1640
fix autocasting bug by @pacman100 in #1637
fix modeling low zero by @abhilash1910 in #1634
Add skorch to runners by @muellerzr in #1646
add save model by @SunMarc in #1641
Change dispatch_model when we have only one device by @SunMarc in #1648
Doc save model by @SunMarc in #1650
Fix device_map by @SunMarc in #1651
Check for port usage before launch by @muellerzr in #1656
[BigModeling] Add missing check for quantized models by @younesbelkada in #1652
Bump integration by @muellerzr in #1658
TIL by @muellerzr in #1657
docker cpu py version by @muellerzr in #1659
[BigModeling] Final fix for dispatch int8 and fp4 models by @younesbelkada in #1660
remove safetensor dep on shard_checkpoint by @SunMarc in #1664
change the import place to avoid import error by @pacman100 in #1653
Update broken Runhouse link in examples/README.md by @dongreenberg in #1668
Bnb quantization by @SunMarc in #1626
replace save funct in doc by @SunMarc in #1672
Doc big model inference by @SunMarc in #1670
Add docs for saving Transformers models by @deppen8 in #1671
fix bnb tests by @SunMarc in #1679
Fix workflow CI by @muellerzr in #1690
remove duplicate class by @SunMarc in #1691
update readme in examples by @statelesshz in #1678
Fix nightly tests by @muellerzr in #1696
Fixup docs by @muellerzr in #1697
Improve quality errors by @muellerzr in #1698
Move mixed precision wrapping ahead of DDP/FSDP wrapping by @ChenWu98 in #1682
Add offload for 8-bit model by @SunMarc in #1699
Deepcopy on Accelerator to return self by @muellerzr in #1694
Update tracking.md by @stevhliu in #1702
Skip tests when bnb isn't available by @muellerzr in #1706
Fix launcher validation by @abhilash1910 in #1705
Fixes for issue #1683: failed to run accelerate config in colab by @Erickrus in #1692
Fix the bug where DataLoaderDispatcher gets stuck in an infinite wait when the dataset is an IterDataPipe during multi-process training. by @yuxinyuan in #1709
add multi_gpu decorator by @SunMarc in #1712
Modify loading checkpoint behavior by @SunMarc in #1715
fix version by @SunMarc in #1701
Keep old behavior by @muellerzr in #1716
Optimize get_scale to reduce async calls by @muellerzr in #1718
Remove duplicate code by @muellerzr in #1717
New tactic by @muellerzr in #1719
add Comfy-UI by @pacman100 in #1723
add compatibility with peft by @SunMarc in #1725
The following contributors have made significant changes to the library over the last release: