v0.11.0 — Accelerate

Gradient Accumulation

Accelerate now handles gradient accumulation if you want, just pass along gradient_accumulation_steps=xxx when instantiating the Accelerator and put all your training loop step under a with accelerator.accumulate(model):. Accelerate will then handle the loss re-scaling and gradient accumulation for you (avoiding slowdowns in distributed training when gradients only need to be synced when you want to step). More details in the documentation.

Add gradient accumulation doc by @muellerzr in #511
Make gradient accumulation work with dispatched dataloaders by @muellerzr in #510
Introduce automatic gradient accumulation wrapper + fix a few test issues by @muellerzr in #484

Support for SageMaker Data parallelism

Accelerate now support SageMaker specific brand of data parallelism.

SageMaker enhancements to allow custom docker image, input channels referring to s3/remote data locations and metrics logging by @pacman100 in #504
SageMaker DP Support by @pacman100 in #494

What's new?

Fix accelerate tests command by @sgugger in #528
FSDP integration enhancements and fixes by @pacman100 in #522
Warn user if no trackers are installed by @muellerzr in #524
Fixup all example CI tests and properly fail by @muellerzr in #517
fixing deepspeed multi-node launcher by @pacman100 in #514
Add special Parameters modules support by @younesbelkada in #519
Don't unwrap in save_state() by @cccntu in #489
Fix a bug when reduce a tensor. by @wwhio in #513
Add benchmarks by @sgugger in #506
Fix DispatchDataLoader length when split_batches=True by @sgugger in #509
Fix scheduler in gradient accumulation example by @muellerzr in #500
update dataloader wrappers to have total_batch_size attribute by @pacman100 in #493
Introduce automatic gradient accumulation wrapper + fix a few test issues by @muellerzr in #484
add use_distributed property by @ZhiyuanChen in #487
fixing fsdp autowrap functionality by @pacman100 in #475
Use datasets 2.2.0 for now by @muellerzr in #481
Rm gradient accumulation on TPU by @muellerzr in #479
Revert "Pin datasets for now by @muellerzr in #477)"
Pin datasets for now by @muellerzr in #477
Some typos and cosmetic fixes by @douwekiela in #472
Fix when TPU device check is ran by @muellerzr in #469
Refactor Utility Documentation by @muellerzr in #467
Add docbuilder to quality by @muellerzr in #468
Expose some is_*_available utils in docs by @muellerzr in #466
Cleanup CI Warnings by @muellerzr in #465
Link CI slow runners to the commit by @muellerzr in #464
Fix subtle bug in BF16 by @muellerzr in #463
Include bf16 support for TPUs and CPUs, and a better check for if a CUDA device supports BF16 by @muellerzr in #462
Handle bfloat16 weights in disk offload without adding memory overhead by @noamwies in #460)
Handle bfloat16 weights in disk offload by @sgugger in #460
Raise a clear warning if a user tries to modify the AcceleratorState by @muellerzr in #458
Right step point by @muellerzr in #459
Better checks for if a TPU device exists by @muellerzr in #456
Offload and modules with unused submodules by @sgugger in #442