2.13.0 — Datasets — releases.sh

Dataset Features

Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770

Stream the data from your Spark DataFrame directly to your training pipeline

from datasets import IterableDataset
from torch.utils.data import DataLoader

ids = IterableDataset.from_spark(df)
ids = ids.map(...).filter(...).with_format("torch")
for batch in DataLoader(ids, batch_size=16, num_workers=4):
    ...

IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:
- IterableDataset Arrow formatting by @lhoestq in https://github.com/huggingface/datasets/pull/5821
- Iterable torch formatting by @lhoestq in https://github.com/huggingface/datasets/pull/5852
```
from datasets import load_dataset

ids = load_dataset("c4", "en", split="train", streaming=True)
ids = ids.map(...).with_format("torch")  # to get PyTorch tensors - also works with tf, np, jax etc.
```
Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893
```
from datasets import IterableDataset

ids = IterableDataset.from_file("path/to/data.arrow")
```
Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944
```
from datasets import load_dataset

ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
```

Experimental

Add parallel module using joblib for Spark by @es94129 in https://github.com/huggingface/datasets/pull/5924

General improvements and bug fixes

Preserve stopping_strategy of shuffled interleaved dataset (random cycling case) by @mariosasko in https://github.com/huggingface/datasets/pull/5816
Fix incomplete docstring for BuilderConfig by @Laurent2916 in https://github.com/huggingface/datasets/pull/5824
[docs] Custom decoding transforms by @stevhliu in https://github.com/huggingface/datasets/pull/5836
Add accelerate as metric's test dependency to fix CI error by @mariosasko in https://github.com/huggingface/datasets/pull/5848
Add date_format param to the CSV reader by @mariosasko in https://github.com/huggingface/datasets/pull/5845
[docs] Redirects, migrated from nginx by @julien-c in https://github.com/huggingface/datasets/pull/5853
Fix infer module for uppercase extensions by @albertvillanova in https://github.com/huggingface/datasets/pull/5872
Minor tqdm optim by @lhoestq in https://github.com/huggingface/datasets/pull/5860
Always set nullable fields in the writer by @lhoestq in https://github.com/huggingface/datasets/pull/5835
Add fn_kwargs to map and filter of IterableDataset and IterableDatasetDict by @yuukicammy in https://github.com/huggingface/datasets/pull/5810
Better error message when combining dataset dicts instead of datasets by @lhoestq in https://github.com/huggingface/datasets/pull/5861
Force overwrite existing filesystem protocol by @baskrahmer in https://github.com/huggingface/datasets/pull/5894
Support working_dir in from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5826
Raise TypeError when indexing a dataset with bool by @albertvillanova in https://github.com/huggingface/datasets/pull/5859
Fix minor typo in docs loading.mdx by @albertvillanova in https://github.com/huggingface/datasets/pull/5900
Fix FixedSizeListArray casting by @mariosasko in https://github.com/huggingface/datasets/pull/5897
Unpin responses by @mariosasko in https://github.com/huggingface/datasets/pull/5916
Validate name parameter in make_file_instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/5904
Raise error in DatasetBuilder.as_dataset when file_format is not "arrow" by @mariosasko in https://github.com/huggingface/datasets/pull/5915
Refactor extensions by @albertvillanova in https://github.com/huggingface/datasets/pull/5917
Use more efficient and idiomatic way to construct list. by @ttsugriy in https://github.com/huggingface/datasets/pull/5909
Add flatten_indices to DatasetDict by @maximxlss in https://github.com/huggingface/datasets/pull/5907
Optimize IterableDataset.from_file using ArrowExamplesIterable by @lhoestq in https://github.com/huggingface/datasets/pull/5920
Make prepare_split more robust if errors in metadata dataset_info splits by @albertvillanova in https://github.com/huggingface/datasets/pull/5901
Fix streaming parquet with image feature in schema by @lhoestq in https://github.com/huggingface/datasets/pull/5921
canonicalize data dir in config ID hash by @kylrth in https://github.com/huggingface/datasets/pull/5899
Fix link to quickstart docs in README.md by @mariosasko in https://github.com/huggingface/datasets/pull/5928
Fix string-encoding, make batch_size optional, and minor improvements in Dataset.to_tf_dataset by @alvarobartt in https://github.com/huggingface/datasets/pull/5883
Use a new low-memory approach for tf dataset index shuffling by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5863
[doc build] Use secrets by @mishig25 in https://github.com/huggingface/datasets/pull/5932
Fix to_numpy when None values in the sequence by @qgallouedec in https://github.com/huggingface/datasets/pull/5933
Better row group size in push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/5935
Avoid parallel redownload in cache by @albertvillanova in https://github.com/huggingface/datasets/pull/5937
Better filenotfound for gated by @lhoestq in https://github.com/huggingface/datasets/pull/5954
Make get_from_cache use custom temp filename that is locked by @albertvillanova in https://github.com/huggingface/datasets/pull/5938
Fix ArrowExamplesIterable.shard_data_sources by @lhoestq in https://github.com/huggingface/datasets/pull/5956
Add Arrow builder docs by @lhoestq in https://github.com/huggingface/datasets/pull/5952
Fix sequence of array support for most dtype by @qgallouedec in https://github.com/huggingface/datasets/pull/5948

New Contributors

@Laurent2916 made their first contribution in https://github.com/huggingface/datasets/pull/5824
@yuukicammy made their first contribution in https://github.com/huggingface/datasets/pull/5810
@baskrahmer made their first contribution in https://github.com/huggingface/datasets/pull/5894
@ttsugriy made their first contribution in https://github.com/huggingface/datasets/pull/5909
@maximxlss made their first contribution in https://github.com/huggingface/datasets/pull/5907
@mariusz-jachimowicz-83 made their first contribution in https://github.com/huggingface/datasets/pull/5893
@kylrth made their first contribution in https://github.com/huggingface/datasets/pull/5899
@qgallouedec made their first contribution in https://github.com/huggingface/datasets/pull/5933
@es94129 made their first contribution in https://github.com/huggingface/datasets/pull/5924

Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef