v4.1.0

Dataset Features

feat: use content defined chunking by @kszucs in https://github.com/huggingface/datasets/pull/7589
- Parquet datasets are now Optimized Parquet !
  <img width="462" height="103" alt="image" src="https://github.com/user-attachments/assets/43703a47-0964-421b-8f01-1a790305de79" />
- internally uses use_content_defined_chunking=True when writing Parquet files
- this enables fast deduped uploads to Hugging Face !
```
# Now faster thanks to content defined chunking
ds.push_to_hub("username/dataset_name")
```
- this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
- with this change, the new default row group size for Parquet is set to 100MB
- write_page_index=True is also used to enable fast random access for the Dataset Viewer and tools that need it
Concurrent push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7708
Concurrent IterableDataset push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7710
HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7690
- load HDF5 datasets in one line of code
```
ds = load_dataset("username/dataset-with-hdf5-files")
```
- each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

Convert to string when needed + faster .zstd by @lhoestq in https://github.com/huggingface/datasets/pull/7683
fix audio cast storage from array + sampling_rate by @lhoestq in https://github.com/huggingface/datasets/pull/7684
Fix misleading add_column() usage example in docstring by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7648
Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in https://github.com/huggingface/datasets/pull/7438
Update fsspec max version to current release 2025.7.0 by @rootAvish in https://github.com/huggingface/datasets/pull/7701
Update dataset_dict push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7711
Retry intermediate commits too by @lhoestq in https://github.com/huggingface/datasets/pull/7712
num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in https://github.com/huggingface/datasets/pull/7702
Update cli.mdx to refer to the new "hf" CLI by @evalstate in https://github.com/huggingface/datasets/pull/7713
fix num_proc=1 ci test by @lhoestq in https://github.com/huggingface/datasets/pull/7714
Docs: Use Image(mode="F") for PNG/JPEG depth maps by @lhoestq in https://github.com/huggingface/datasets/pull/7715
typo by @lhoestq in https://github.com/huggingface/datasets/pull/7716
fix largelist repr by @lhoestq in https://github.com/huggingface/datasets/pull/7735
Grammar fix: correct "showed" to "shown" in fingerprint.py by @brchristian in https://github.com/huggingface/datasets/pull/7730
Fix type hint train_test_split by @qgallouedec in https://github.com/huggingface/datasets/pull/7736
fix(webdataset): don't .lower() field_name by @YassineYousfi in https://github.com/huggingface/datasets/pull/7726
Refactor HDF5 and preserve tree structure by @klamike in https://github.com/huggingface/datasets/pull/7743
docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in https://github.com/huggingface/datasets/pull/7737
Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in https://github.com/huggingface/datasets/pull/7761
Support pathlib.Path for feature input by @Joshua-Chin in https://github.com/huggingface/datasets/pull/7755
add support for pyarrow string view in features by @onursatici in https://github.com/huggingface/datasets/pull/7718
Fix typo in error message for cache directory deletion by @brchristian in https://github.com/huggingface/datasets/pull/7749
update torchcodec in ci by @lhoestq in https://github.com/huggingface/datasets/pull/7764
Bump dill to 0.4.0 by @Bomme in https://github.com/huggingface/datasets/pull/7763

New Contributors

@DavidRConnell made their first contribution in https://github.com/huggingface/datasets/pull/7438
@rootAvish made their first contribution in https://github.com/huggingface/datasets/pull/7701
@tanuj-rai made their first contribution in https://github.com/huggingface/datasets/pull/7702
@evalstate made their first contribution in https://github.com/huggingface/datasets/pull/7713
@brchristian made their first contribution in https://github.com/huggingface/datasets/pull/7730
@klamike made their first contribution in https://github.com/huggingface/datasets/pull/7690
@YassineYousfi made their first contribution in https://github.com/huggingface/datasets/pull/7726
@Sanjaykumar030 made their first contribution in https://github.com/huggingface/datasets/pull/7737
@kszucs made their first contribution in https://github.com/huggingface/datasets/pull/7589
@Joshua-Chin made their first contribution in https://github.com/huggingface/datasets/pull/7755
@onursatici made their first contribution in https://github.com/huggingface/datasets/pull/7718
@Bomme made their first contribution in https://github.com/huggingface/datasets/pull/7763

Full Changelog: https://github.com/huggingface/datasets/compare/4.0.0...4.1.0

Dataset Features

Other improvements and bug fixes

New Contributors

More from Hugging Face

More from Hugging Face