Deepspeed Ulysses/ALST integration
Deepspeed Ulysses/ALST is an efficient way of training on long sequences by employing sequence parallelism and attention head parallelism. You can learn more about this technology in this paper https://arxiv.org/abs/2506.13996 or this deepspeed tutorial https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/.
<img width="2368" height="1250" alt="0d8bd9e0" src="https://github.com/user-attachments/assets/b94e90c9-4368-4711-ad57-58de3c714ebc" />To enable Deepspeed Ulysses, you first need to create ParallelismConfig and setting sp related args:
parallelism_config = ParallelismConfig(
sp_backend="deepspeed",
sp_size=2,
sp_handler=DeepSpeedSequenceParallelConfig(...),
)
Then, you need to make sure to compute the correct loss as described on our docs
...
losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
good_tokens = (shift_labels != -100).view(-1).sum()
good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
total_loss = sum(
losses_per_rank[rank] * good_tokens_per_rank[rank]
for rank in range(sp_world_size)
if good_tokens_per_rank[rank] > 0
)
total_good_tokens = sum(good_tokens_per_rank)
loss = total_loss / max(total_good_tokens, 1)
Thanks @S1ro1 for starting this work and for @stas00 for finishing this work. Also thanks @kashif for adding docs and reviewing/testing this PR !
This feature will also be available in HF Trainer thanks for this PR from @stas00: https://github.com/huggingface/transformers/pull/41832
Minor changes
- Remove warning for
cpu_ram_efficient_loadingby @SunMarc in https://github.com/huggingface/accelerate/pull/3816 - update typo in bnb quantisation 4bit flag docstring by @hbraith in https://github.com/huggingface/accelerate/pull/3828
- ArXiv -> HF Papers by @qgallouedec in https://github.com/huggingface/accelerate/pull/3834
- Fix typo in broadcast_object_list docstring by @wsntxxn in https://github.com/huggingface/accelerate/pull/3823
- [Bug] Update torch.optim.Optimizer parameter states after tensor parallelism by @naomili0924 in https://github.com/huggingface/accelerate/pull/3835
- use self hosted runner by @SunMarc in https://github.com/huggingface/accelerate/pull/3841
- device type helper by @kashif in https://github.com/huggingface/accelerate/pull/3843
New Contributors
- @hbraith made their first contribution in https://github.com/huggingface/accelerate/pull/3828
- @wsntxxn made their first contribution in https://github.com/huggingface/accelerate/pull/3823
- @naomili0924 made their first contribution in https://github.com/huggingface/accelerate/pull/3835
Full Changelog: https://github.com/huggingface/accelerate/compare/v1.11.0...v1.12.0
Fetched April 7, 2026
