Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770
from datasets import IterableDataset
from torch.utils.data import DataLoader
ids = IterableDataset.from_spark(df)
ids = ids.map(...).filter(...).with_format("torch")
for batch in DataLoader(ids, batch_size=16, num_workers=4):
...
IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:
from datasets import load_dataset
ids = load_dataset("c4", "en", split="train", streaming=True)
ids = ids.map(...).with_format("torch") # to get PyTorch tensors - also works with tf, np, jax etc.
Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893
from datasets import IterableDataset
ids = IterableDataset.from_file("path/to/data.arrow")
Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944
from datasets import load_dataset
ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
stopping_strategy of shuffled interleaved dataset (random cycling case) by @mariosasko in https://github.com/huggingface/datasets/pull/5816BuilderConfig by @Laurent2916 in https://github.com/huggingface/datasets/pull/5824accelerate as metric's test dependency to fix CI error by @mariosasko in https://github.com/huggingface/datasets/pull/5848date_format param to the CSV reader by @mariosasko in https://github.com/huggingface/datasets/pull/5845fn_kwargs to map and filter of IterableDataset and IterableDatasetDict by @yuukicammy in https://github.com/huggingface/datasets/pull/5810FixedSizeListArray casting by @mariosasko in https://github.com/huggingface/datasets/pull/5897DatasetBuilder.as_dataset when file_format is not "arrow" by @mariosasko in https://github.com/huggingface/datasets/pull/5915flatten_indices to DatasetDict by @maximxlss in https://github.com/huggingface/datasets/pull/5907batch_size optional, and minor improvements in Dataset.to_tf_dataset by @alvarobartt in https://github.com/huggingface/datasets/pull/5883to_numpy when None values in the sequence by @qgallouedec in https://github.com/huggingface/datasets/pull/5933Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef
Fetched April 7, 2026