releases.shpreview

2.13.0

$npx -y @buildinternet/releases show rel_IkPqvWmjheQXAoYiM06KE

Dataset Features

  • Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770

    • Stream the data from your Spark DataFrame directly to your training pipeline
    from datasets import IterableDataset
    from torch.utils.data import DataLoader
    
    ids = IterableDataset.from_spark(df)
    ids = ids.map(...).filter(...).with_format("torch")
    for batch in DataLoader(ids, batch_size=16, num_workers=4):
        ...
    
  • IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:

    from datasets import load_dataset
    
    ids = load_dataset("c4", "en", split="train", streaming=True)
    ids = ids.map(...).with_format("torch")  # to get PyTorch tensors - also works with tf, np, jax etc.
    
  • Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893

    from datasets import IterableDataset
    
    ids = IterableDataset.from_file("path/to/data.arrow")
    
  • Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944

    from datasets import load_dataset
    
    ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
    

Experimental

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef

Fetched April 7, 2026