v0.4.0 Experimental DeepSpeed support

This release adds support for DeepSpeed. While the basics are there to support ZeRO-2, ZeRo-3, as well a CPU and NVME offload, the API might evolve a little bit as we polish it in the near future.

It also adds support for multi-node CPU. In both cases, just filling the questionnaire outputted by accelerate config and then launching your script with accelerate launch is enough, there are no changes in the main API.

DeepSpeed support

Add DeepSpeed support #82 (@vasudevgupta7)
DeepSpeed documentation #140 (@sgugger)

Multinode CPU support

Add distributed multi-node cpu only support (MULTI_CPU) #63 (@ddkalamk)

Various fixes

Fix batch_sampler error for IterableDataset #62 (@ddkalamk)
Honor namedtuples in inputs/outputs #67 (@sgugger)
Fix examples README #70 (@cccntu)
TPU not available in kaggle #73 (@yuangan)
Pass args in notebook_launcher for multi-GPU #78 (@sgugger)
Fix accelerate test with no config file #79 (@cccntu)
Use optimizer for consistency #81 (@kumapo)
Update README.md #87 (@Separius)
Add unscale_gradients method. #88 (@sgugger)
Add Accelerator.free_memory #89 (@sgugger)
[Feature] Add context manager to allow main process first. #98 (@Guillem96)
Pass along kwargs to backward #104 (@sgugger)
Add course banner #107 (@sgugger)
added closure argument to optimizer.step() #105 (@pmelchior)
Fix import error for torch 1.4.0 #108 (@sgugger)
Unwrap optimizer before unscaling #115 (@sgugger)
Fix DataLoader length when split_batches=True #121 (@sgugger)
Fix OptimWrapper init #127 (@sgugger)
Fix fp16 by converting outputs back to FP32 #134 (@sgugger)
Add caveat on weight-tying on TPUs #138 (@sgugger)
Add optimizer not stepped property #139 (@sgugger)