v1.12.0: Deepspeed Ulysses/ALST (v1.12.0) — Accelerate

Deepspeed Ulysses/ALST integration

Deepspeed Ulysses/ALST is an efficient way of training on long sequences by employing sequence parallelism and attention head parallelism. You can learn more about this technology in this paper https://arxiv.org/abs/2506.13996 or this deepspeed tutorial https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/.

To enable Deepspeed Ulysses, you first need to create ParallelismConfig and setting sp related args:

parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=2,
    sp_handler=DeepSpeedSequenceParallelConfig(...),
)

Then, you need to make sure to compute the correct loss as described on our docs

        ...
        losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
        good_tokens = (shift_labels != -100).view(-1).sum()
        good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
        total_loss = sum(
            losses_per_rank[rank] * good_tokens_per_rank[rank]
            for rank in range(sp_world_size)
            if good_tokens_per_rank[rank] > 0
        )
        total_good_tokens = sum(good_tokens_per_rank)
        loss = total_loss / max(total_good_tokens, 1)

Thanks @S1ro1 for starting this work and for @stas00 for finishing this work. Also thanks @kashif for adding docs and reviewing/testing this PR !

This feature will also be available in HF Trainer thanks for this PR from @stas00: https://github.com/huggingface/transformers/pull/41832

Minor changes

Remove warning for cpu_ram_efficient_loading by @SunMarc in https://github.com/huggingface/accelerate/pull/3816
update typo in bnb quantisation 4bit flag docstring by @hbraith in https://github.com/huggingface/accelerate/pull/3828
ArXiv -> HF Papers by @qgallouedec in https://github.com/huggingface/accelerate/pull/3834
Fix typo in broadcast_object_list docstring by @wsntxxn in https://github.com/huggingface/accelerate/pull/3823
[Bug] Update torch.optim.Optimizer parameter states after tensor parallelism by @naomili0924 in https://github.com/huggingface/accelerate/pull/3835
use self hosted runner by @SunMarc in https://github.com/huggingface/accelerate/pull/3841
device type helper by @kashif in https://github.com/huggingface/accelerate/pull/3843

New Contributors

@hbraith made their first contribution in https://github.com/huggingface/accelerate/pull/3828
@wsntxxn made their first contribution in https://github.com/huggingface/accelerate/pull/3823
@naomili0924 made their first contribution in https://github.com/huggingface/accelerate/pull/3835

Full Changelog: https://github.com/huggingface/accelerate/compare/v1.11.0...v1.12.0

v1.12.0: Deepspeed Ulysses/ALST

Deepspeed Ulysses/ALST integration

Minor changes

New Contributors

More from Hugging Face

More from Hugging Face