releases.shpreview

4.6.0

February 25, 2026DatasetsView original ↗
$npx -y @buildinternet/releases show rel_nk5yfPK6UiAkyxWXsRSq-

Dataset Features

  • Support Image, Video and Audio types in Lance datasets

    >>> from datasets import load_dataset
    >>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
    >>> ds.features
    {'video_blob': Video(),
     'video_path': Value('string'),
     'caption': Value('string'),
     'aesthetic_score': Value('float64'),
     'motion_score': Value('float64'),
     'temporal_consistency_score': Value('float64'),
     'camera_motion': Value('string'),
     'frame': Value('int64'),
     'fps': Value('float64'),
     'seconds': Value('float64'),
     'embedding': List(Value('float32'), length=1024)}
    
  • Push to hub now supports Video types

     >>> from datasets import Dataset, Video
    >>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
    >>> ds = ds.cast_column("video", Video())
    >>> ds.push_to_hub("username/my-video-dataset")
    
  • Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7976

    • this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
    • E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
    • See more info here: https://huggingface.co/docs/hub/en/xet/deduplication
<p align="center"> <a href="https://huggingface.co/docs/hub/en/xet/deduplication"> <img height="200" alt="image" src="https://github.com/user-attachments/assets/dd0de6a2-24a1-4945-8d25-44b763c1151e" /> </a> </p>
  • Add IterableDataset.reshard() by @lhoestq in https://github.com/huggingface/datasets/pull/7992

    Reshard the dataset if possible, i.e. split the current shards further into more shards. This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards. Equality may happen if no shard can be split further.

    The resharding mechanism depends on the dataset file format:

    • Parquet: shard per row group instead of per file
    • Other: not implemented yet (contributions are welcome !)
    >>> from datasets import load_dataset
    >>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
    >>> ds
    IterableDataset({
        features: ['label', 'title', 'content'],
        num_shards: 4
    })
    >>> ds.reshard()
    IterableDataset({
        features: ['label', 'title', 'content'],
        num_shards: 3600
    })
    

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.5.0...4.6.0

Fetched April 7, 2026