Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in https://github.com/huggingface/datasets/pull/8064
from datasets import load_dataset
# load raw data from a Storage Bucket on HF
ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"])
# or manually, using hf:// paths
ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"])
# process, filter
ds = ds.map(...).filter(...)
# publish the AI-ready dataset
ds.push_to_hub("username/my-dataset-ready-for-training")
This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork).
And it bumps dill and multiprocess versions to support python 3.14
Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in https://github.com/huggingface/datasets/pull/8068
max_shard_size to IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)zip://*.jsonl::hf://datasets/username/dataset-name/data.zipFull Changelog: https://github.com/huggingface/datasets/compare/4.7.0...4.8.0
Fetched April 7, 2026