v4.8.0

Dataset Features

Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in https://github.com/huggingface/datasets/pull/8064

from datasets import load_dataset
# load raw data from a Storage Bucket on HF
ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"])
# or manually, using hf:// paths
ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"])
# process, filter
ds = ds.map(...).filter(...)
# publish the AI-ready dataset
ds.push_to_hub("username/my-dataset-ready-for-training")

This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork). And it bumps dill and multiprocess versions to support python 3.14

Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in https://github.com/huggingface/datasets/pull/8068
- added max_shard_size to IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)
- more arrow-native iterable operations for IterableDataset
- better support of glob patterns in archives, e.g. zip://*.jsonl::hf://datasets/username/dataset-name/data.zip
- fixes for to_pandas, videofolder, load_dataset_builder kwargs

What's Changed

fix reshard_data_sources by @lhoestq in https://github.com/huggingface/datasets/pull/8061
Improve error message for invalid data_files pattern format by @kushalkkb in https://github.com/huggingface/datasets/pull/8060
fix null filling in missing jsonl columns by @lhoestq in https://github.com/huggingface/datasets/pull/8069

New Contributors

@kushalkkb made their first contribution in https://github.com/huggingface/datasets/pull/8060
@Michael-RDev made their first contribution in https://github.com/huggingface/datasets/pull/8068

Full Changelog: https://github.com/huggingface/datasets/compare/4.7.0...4.8.0

Dataset Features

What's Changed

New Contributors

More from Hugging Face

More from Hugging Face