feat: use content defined chunking by @kszucs in https://github.com/huggingface/datasets/pull/7589
Parquet datasets are now Optimized Parquet !
<img width="462" height="103" alt="image" src="https://github.com/user-attachments/assets/43703a47-0964-421b-8f01-1a790305de79" />internally uses use_content_defined_chunking=True when writing Parquet files
this enables fast deduped uploads to Hugging Face !
# Now faster thanks to content defined chunking
ds.push_to_hub("username/dataset_name")
write_page_index=True is also used to enable fast random access for the Dataset Viewer and tools that need itConcurrent push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7708
Concurrent IterableDataset push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7710
HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7690
ds = load_dataset("username/dataset-with-hdf5-files")
train_test_split by @qgallouedec in https://github.com/huggingface/datasets/pull/7736Full Changelog: https://github.com/huggingface/datasets/compare/4.0.0...4.1.0
Fetched April 7, 2026