releases.shpreview

4.8.0

$npx -y @buildinternet/releases show rel_zvxpmWOtEMRVHYL7e67kp

Dataset Features

  • Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in https://github.com/huggingface/datasets/pull/8064

    from datasets import load_dataset
    # load raw data from a Storage Bucket on HF
    ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"])
    # or manually, using hf:// paths
    ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"])
    # process, filter
    ds = ds.map(...).filter(...)
    # publish the AI-ready dataset
    ds.push_to_hub("username/my-dataset-ready-for-training")
    

    This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork). And it bumps dill and multiprocess versions to support python 3.14

  • Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in https://github.com/huggingface/datasets/pull/8068

    • added max_shard_size to IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)
    • more arrow-native iterable operations for IterableDataset
    • better support of glob patterns in archives, e.g. zip://*.jsonl::hf://datasets/username/dataset-name/data.zip
    • fixes for to_pandas, videofolder, load_dataset_builder kwargs

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.7.0...4.8.0

Fetched April 7, 2026