Support Image, Video and Audio types in Lance datasets
>>> from datasets import load_dataset
>>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
>>> ds.features
{'video_blob': Video(),
'video_path': Value('string'),
'caption': Value('string'),
'aesthetic_score': Value('float64'),
'motion_score': Value('float64'),
'temporal_consistency_score': Value('float64'),
'camera_motion': Value('string'),
'frame': Value('int64'),
'fps': Value('float64'),
'seconds': Value('float64'),
'embedding': List(Value('float32'), length=1024)}
Push to hub now supports Video types
>>> from datasets import Dataset, Video
>>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
>>> ds = ds.cast_column("video", Video())
>>> ds.push_to_hub("username/my-video-dataset")
Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7976
Add IterableDataset.reshard() by @lhoestq in https://github.com/huggingface/datasets/pull/7992
Reshard the dataset if possible, i.e. split the current shards further into more shards. This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards. Equality may happen if no shard can be split further.
The resharding mechanism depends on the dataset file format:
>>> from datasets import load_dataset
>>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
>>> ds
IterableDataset({
features: ['label', 'title', 'content'],
num_shards: 4
})
>>> ds.reshard()
IterableDataset({
features: ['label', 'title', 'content'],
num_shards: 3600
})
transformers v5 and huggingface_hub v1 by @hanouticelina in https://github.com/huggingface/datasets/pull/7989Full Changelog: https://github.com/huggingface/datasets/compare/4.5.0...4.6.0
Fetched April 7, 2026