v4.6.0

Dataset Features

Support Image, Video and Audio types in Lance datasets

Infer types from lance blobs by @lhoestq in https://github.com/huggingface/datasets/pull/7966

>>> from datasets import load_dataset
>>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
>>> ds.features
{'video_blob': Video(),
 'video_path': Value('string'),
 'caption': Value('string'),
 'aesthetic_score': Value('float64'),
 'motion_score': Value('float64'),
 'temporal_consistency_score': Value('float64'),
 'camera_motion': Value('string'),
 'frame': Value('int64'),
 'fps': Value('float64'),
 'seconds': Value('float64'),
 'embedding': List(Value('float32'), length=1024)}

Push to hub now supports Video types

push_to_hub() for videos by @lhoestq in https://github.com/huggingface/datasets/pull/7971

 >>> from datasets import Dataset, Video
>>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
>>> ds = ds.cast_column("video", Video())
>>> ds.push_to_hub("username/my-video-dataset")

Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7976
- this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
- E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
- See more info here: https://huggingface.co/docs/hub/en/xet/deduplication

Add IterableDataset.reshard() by @lhoestq in https://github.com/huggingface/datasets/pull/7992

Reshard the dataset if possible, i.e. split the current shards further into more shards. This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards. Equality may happen if no shard can be split further.

The resharding mechanism depends on the dataset file format:
- Parquet: shard per row group instead of per file
- Other: not implemented yet (contributions are welcome !)
```
>>> from datasets import load_dataset
>>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
>>> ds
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 4
})
>>> ds.reshard()
IterableDataset({
    features: ['label', 'title', 'content'],
    num_shards: 3600
})
```

What's Changed

Fix load_from_disk progress bar with redirected stdout by @omarfarhoud in https://github.com/huggingface/datasets/pull/7919
Revert "feat: avoid some copies in torch formatter (#7787)" by @lhoestq in https://github.com/huggingface/datasets/pull/7961
docs: fix grammar and add type hints in splits.py by @Edge-Explorer in https://github.com/huggingface/datasets/pull/7960
Fix interleave_datasets with all_exhausted_without_replacement strategy by @prathamk-tw in https://github.com/huggingface/datasets/pull/7955
Add examples for Lance datasets by @prrao87 in https://github.com/huggingface/datasets/pull/7950
Support null in json string cols by @lhoestq in https://github.com/huggingface/datasets/pull/7963
handle blob lance by @lhoestq in https://github.com/huggingface/datasets/pull/7964
Count examples in lance by @lhoestq in https://github.com/huggingface/datasets/pull/7969
Use temp files in push_to_hub to save memory by @lhoestq in https://github.com/huggingface/datasets/pull/7979
Drop python 3.9 by @lhoestq in https://github.com/huggingface/datasets/pull/7980
Support pandas 3 by @lhoestq in https://github.com/huggingface/datasets/pull/7981
Remove unused data files optims by @lhoestq in https://github.com/huggingface/datasets/pull/7985
Remove pre-release workaround in CI for transformers v5 and huggingface_hub v1 by @hanouticelina in https://github.com/huggingface/datasets/pull/7989
very basic support for more hf urls by @lhoestq in https://github.com/huggingface/datasets/pull/8003
Bump fsspec upper bound to 2026.2.0 (fixes #7994) by @jayzuccarelli in https://github.com/huggingface/datasets/pull/7995
Fix: make environment variable naming consistent (issue #7998) by @AnkitAhlawat7742 in https://github.com/huggingface/datasets/pull/8000
More IterableDataset.from_x methods and docs and polars.Lazyframe support by @lhoestq in https://github.com/huggingface/datasets/pull/8009
Support empty shard in from_generator by @lhoestq in https://github.com/huggingface/datasets/pull/8023
Allow import polars in map() by @lhoestq in https://github.com/huggingface/datasets/pull/8024

New Contributors

@omarfarhoud made their first contribution in https://github.com/huggingface/datasets/pull/7919
@Edge-Explorer made their first contribution in https://github.com/huggingface/datasets/pull/7960
@prathamk-tw made their first contribution in https://github.com/huggingface/datasets/pull/7955
@prrao87 made their first contribution in https://github.com/huggingface/datasets/pull/7950
@hanouticelina made their first contribution in https://github.com/huggingface/datasets/pull/7989
@jayzuccarelli made their first contribution in https://github.com/huggingface/datasets/pull/7995
@AnkitAhlawat7742 made their first contribution in https://github.com/huggingface/datasets/pull/8000

Full Changelog: https://github.com/huggingface/datasets/compare/4.5.0...4.6.0

Dataset Features

What's Changed

New Contributors

More from Hugging Face

More from Hugging Face