v4.1.0
Dataset Features
-
feat: use content defined chunking by @kszucs in https://github.com/huggingface/datasets/pull/7589
-
Parquet datasets are now Optimized Parquet !
<img width="462" height="103" alt="image" src="https://github.com/user-attachments/assets/43703a47-0964-421b-8f01-1a790305de79" /> -
internally uses
use_content_defined_chunking=Truewhen writing Parquet files -
this enables fast deduped uploads to Hugging Face !
# Now faster thanks to content defined chunking ds.push_to_hub("username/dataset_name")- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
- with this change, the new default row group size for Parquet is set to 100MB
write_page_index=Trueis also used to enable fast random access for the Dataset Viewer and tools that need it
-
-
Concurrent push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7708
-
Concurrent IterableDataset push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7710
-
HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7690
- load HDF5 datasets in one line of code
ds = load_dataset("username/dataset-with-hdf5-files")- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows
Other improvements and bug fixes
- Convert to string when needed + faster .zstd by @lhoestq in https://github.com/huggingface/datasets/pull/7683
- fix audio cast storage from array + sampling_rate by @lhoestq in https://github.com/huggingface/datasets/pull/7684
- Fix misleading add_column() usage example in docstring by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7648
- Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in https://github.com/huggingface/datasets/pull/7438
- Update fsspec max version to current release 2025.7.0 by @rootAvish in https://github.com/huggingface/datasets/pull/7701
- Update dataset_dict push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7711
- Retry intermediate commits too by @lhoestq in https://github.com/huggingface/datasets/pull/7712
- num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in https://github.com/huggingface/datasets/pull/7702
- Update cli.mdx to refer to the new "hf" CLI by @evalstate in https://github.com/huggingface/datasets/pull/7713
- fix num_proc=1 ci test by @lhoestq in https://github.com/huggingface/datasets/pull/7714
- Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in https://github.com/huggingface/datasets/pull/7715
- typo by @lhoestq in https://github.com/huggingface/datasets/pull/7716
- fix largelist repr by @lhoestq in https://github.com/huggingface/datasets/pull/7735
- Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in https://github.com/huggingface/datasets/pull/7730
- Fix type hint
train_test_splitby @qgallouedec in https://github.com/huggingface/datasets/pull/7736 - fix(webdataset): don't .lower() field_name by @YassineYousfi in https://github.com/huggingface/datasets/pull/7726
- Refactor HDF5 and preserve tree structure by @klamike in https://github.com/huggingface/datasets/pull/7743
- docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in https://github.com/huggingface/datasets/pull/7737
- Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in https://github.com/huggingface/datasets/pull/7761
- Support pathlib.Path for feature input by @Joshua-Chin in https://github.com/huggingface/datasets/pull/7755
- add support for pyarrow string view in features by @onursatici in https://github.com/huggingface/datasets/pull/7718
- Fix typo in error message for cache directory deletion by @brchristian in https://github.com/huggingface/datasets/pull/7749
- update torchcodec in ci by @lhoestq in https://github.com/huggingface/datasets/pull/7764
- Bump dill to 0.4.0 by @Bomme in https://github.com/huggingface/datasets/pull/7763
New Contributors
- @DavidRConnell made their first contribution in https://github.com/huggingface/datasets/pull/7438
- @rootAvish made their first contribution in https://github.com/huggingface/datasets/pull/7701
- @tanuj-rai made their first contribution in https://github.com/huggingface/datasets/pull/7702
- @evalstate made their first contribution in https://github.com/huggingface/datasets/pull/7713
- @brchristian made their first contribution in https://github.com/huggingface/datasets/pull/7730
- @klamike made their first contribution in https://github.com/huggingface/datasets/pull/7690
- @YassineYousfi made their first contribution in https://github.com/huggingface/datasets/pull/7726
- @Sanjaykumar030 made their first contribution in https://github.com/huggingface/datasets/pull/7737
- @kszucs made their first contribution in https://github.com/huggingface/datasets/pull/7589
- @Joshua-Chin made their first contribution in https://github.com/huggingface/datasets/pull/7755
- @onursatici made their first contribution in https://github.com/huggingface/datasets/pull/7718
- @Bomme made their first contribution in https://github.com/huggingface/datasets/pull/7763
Full Changelog: https://github.com/huggingface/datasets/compare/4.0.0...4.1.0
Fetched April 7, 2026
