v1.12.0: Deepspeed Ulysses/ALST
Deepspeed Ulysses/ALST is an efficient way of training on long sequences by employing sequence parallelism and attention head parallelism. You can learn more about this technology in this paper https://arxiv.org/abs/2506.13996 or this deepspeed tutorial https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/.
<img width="2368" height="1250" alt="0d8bd9e0" src="https://github.com/user-attachments/assets/b94e90c9-4368-4711-ad57-58de3c714ebc" />To enable Deepspeed Ulysses, you first need to create ParallelismConfig and setting sp related args:
parallelism_config = ParallelismConfig(
sp_backend="deepspeed",
sp_size=2,
sp_handler=DeepSpeedSequenceParallelConfig(...),
)
Then, you need to make sure to compute the correct loss as described on our docs
...
losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
good_tokens = (shift_labels != -100).view(-1).sum()
good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
total_loss = sum(
losses_per_rank[rank] * good_tokens_per_rank[rank]
for rank in range(sp_world_size)
if good_tokens_per_rank[rank] > 0
)
total_good_tokens = sum(good_tokens_per_rank)
loss = total_loss / max(total_good_tokens, 1)
Thanks @S1ro1 for starting this work and for @stas00 for finishing this work. Also thanks @kashif for adding docs and reviewing/testing this PR !
This feature will also be available in HF Trainer thanks for this PR from @stas00: https://github.com/huggingface/transformers/pull/41832
cpu_ram_efficient_loading by @SunMarc in https://github.com/huggingface/accelerate/pull/3816Full Changelog: https://github.com/huggingface/accelerate/compare/v1.11.0...v1.12.0
Fetched April 7, 2026