Add IterableDataset.push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7595
# Build streaming data pipelines in a few lines of code !
from datasets import load_dataset
ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)
Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in https://github.com/huggingface/datasets/pull/7606
# Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)
New Column object
# Syntax:
ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)
# Iterate on a column:
for text in ds["text"]:
...
# Load one cell without bringing the full column in memory
first_text = ds["text"][0] # equivalent to ds[0]["text"]
Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
# Don't download full audios/videos when it's not necessary
# Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset
ds = load_dataset(..., streaming=True)
for example in ds:
video = example["video"]
frames = video.get_frames_in_range(start=0, stop=6, step=1) # only stream certain frames
torch>=2.7.0 and FFmpeg >= 4datasets<4.0AudioDecoder:audio = dataset[0]["audio"] # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples() # or use get_samples_played_in_range(...)
samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate # 16000
# old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]
VideoDecoder:video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape # (3, 240, 320)
first_frame.pts_seconds # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape # torch.Size([5, 3, 240, 320])
Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
trust_remote_code is no longer supportedTorchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634
List typefrom datasets import Features, List, Value
features = Features({
"texts": List(Value("string")),
"four_paragraphs": List(Value("string"), length=4)
})
Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeaturefrom datasets import Sequence
Sequence(Value("string")) # List(Value("string"))
Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))}
Dataset.map to reuse cache files mapped with different num_proc by @ringohoffman in https://github.com/huggingface/datasets/pull/7434RepeatExamplesIterable by @SilvanCodes in https://github.com/huggingface/datasets/pull/7581_dill.py to use co_linetable for Python 3.10+ in place of co_lnotab by @qgallouedec in https://github.com/huggingface/datasets/pull/7609Full Changelog: https://github.com/huggingface/datasets/compare/3.6.0...4.0.0
Fetched April 7, 2026