Parallel implementation of to_tf_dataset() by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5377
num_workers= to .to_tf_dataset() to make your dataset faster with multiprocessingDistributed support by @lhoestq in https://github.com/huggingface/datasets/pull/5369
Dataset and IterableDataset (e.g. in streaming mode)import os
from datasets.distributed import split_dataset_by_node
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in https://github.com/huggingface/datasets/pull/5400
Tqdm progress bar for to_parquet by @zanussbaum in https://github.com/huggingface/datasets/pull/5456
ZIP files support in iter_archive with better compression type check by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3379
Support other formats than uint8 for image arrays by @vigsterkr in https://github.com/huggingface/datasets/pull/5365
fs.open resource leaks by @tkukurin in https://github.com/huggingface/datasets/pull/5358cast_to_python_objects by @mariosasko in https://github.com/huggingface/datasets/pull/5384load_dataset docstring by @mariosasko in https://github.com/huggingface/datasets/pull/5389shard_size arg from .push_to_hub() by @polinaeterna in https://github.com/huggingface/datasets/pull/5469Full Changelog: https://github.com/huggingface/datasets/compare/2.8.0...2.9.0
Fetched April 7, 2026