releases.shpreview

4.1.0

September 15, 2025DatasetsView original ↗
$npx -y @buildinternet/releases show rel_1i26Tqgd1omeiiKqpPI08

Dataset Features

  • feat: use content defined chunking by @kszucs in https://github.com/huggingface/datasets/pull/7589

    • Parquet datasets are now Optimized Parquet !

      <img width="462" height="103" alt="image" src="https://github.com/user-attachments/assets/43703a47-0964-421b-8f01-1a790305de79" />
    • internally uses use_content_defined_chunking=True when writing Parquet files

    • this enables fast deduped uploads to Hugging Face !

    # Now faster thanks to content defined chunking
    ds.push_to_hub("username/dataset_name")
    
    • this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
    • with this change, the new default row group size for Parquet is set to 100MB
    • write_page_index=True is also used to enable fast random access for the Dataset Viewer and tools that need it
  • Concurrent push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7708

  • Concurrent IterableDataset push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7710

  • HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7690

    • load HDF5 datasets in one line of code
    ds = load_dataset("username/dataset-with-hdf5-files")
    
    • each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.0.0...4.1.0

Fetched April 7, 2026