2.12.0 — Datasets — releases.sh

Datasets Features

Add Dataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5701
- Get a Dataset from a Spark DataFrame (docs):
```
>>> from datasets import Dataset
>>> ds = Dataset.from_spark(df)
```

Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in https://github.com/huggingface/datasets/pull/5689

Stream data from Wikipedia:

>>> from datasets import load_dataset
>>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
>>> next(iter(ds["train"]))
{'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}

Implement sharding on merged iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5735

Use interleaved datasets in a distributed setup or with a DataLoader

>>> from datasets import load_dataset, interleave_datasets
>>> from torch.utils.data import DataLoader
>>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
>>> c4 = load_dataset("c4", "en", split="train", streaming=True)
>>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
>>> dataloader = DataLoader(merged, num_workers=4)

Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in https://github.com/huggingface/datasets/pull/5751
- Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
- Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
- Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

Fix a description error for interleave_datasets. by @QizhiPei in https://github.com/huggingface/datasets/pull/5680
[docs] Split pattern search order by @stevhliu in https://github.com/huggingface/datasets/pull/5693
Raise an error on missing distributed seed by @lhoestq in https://github.com/huggingface/datasets/pull/5697
Fix xnumpy_load for .npz files by @albertvillanova in https://github.com/huggingface/datasets/pull/5714
Temporarily pin fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/5731
Unpin fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/5733
Fix CI warnings by @albertvillanova in https://github.com/huggingface/datasets/pull/5741
Fix CI mock filesystem fixtures by @albertvillanova in https://github.com/huggingface/datasets/pull/5740
Fix link in docs by @bbbxyz in https://github.com/huggingface/datasets/pull/5746
fix typo: "mow" -> "now" by @csris in https://github.com/huggingface/datasets/pull/5763
[docs] Compress data files by @stevhliu in https://github.com/huggingface/datasets/pull/5691
Fix style by @lhoestq in https://github.com/huggingface/datasets/pull/5774
Minor tqdm fixes by @mariosasko in https://github.com/huggingface/datasets/pull/5754
Fixes #5757 by @eli-osherovich in https://github.com/huggingface/datasets/pull/5758
Fix JSON builder when missing keys in first row by @albertvillanova in https://github.com/huggingface/datasets/pull/5772
Warning specifying future change in to_tf_dataset behaviour by @amyeroberts in https://github.com/huggingface/datasets/pull/5742
Prepare tests for hfh 0.14 by @Wauplin in https://github.com/huggingface/datasets/pull/5788
Call fs.makedirs in save_to_disk by @lhoestq in https://github.com/huggingface/datasets/pull/5779
Allow to run CI on push to ci-branch by @albertvillanova in https://github.com/huggingface/datasets/pull/5790
Fix nondeterministic sharded data split order by @albertvillanova in https://github.com/huggingface/datasets/pull/5729
Raise subprocesses traceback when interrupting by @lhoestq in https://github.com/huggingface/datasets/pull/5784
Fix spark imports by @lhoestq in https://github.com/huggingface/datasets/pull/5795
Change downloaded file permission based on umask by @albertvillanova in https://github.com/huggingface/datasets/pull/5800
Fix inferring module for unsupported data files by @albertvillanova in https://github.com/huggingface/datasets/pull/5787
Reorder default data splits to have validation before test by @albertvillanova in https://github.com/huggingface/datasets/pull/5718
Validate non-empty data_files by @albertvillanova in https://github.com/huggingface/datasets/pull/5802
Spark docs by @lhoestq in https://github.com/huggingface/datasets/pull/5796
Release: 2.12.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5803

New Contributors

@QizhiPei made their first contribution in https://github.com/huggingface/datasets/pull/5680
@bbbxyz made their first contribution in https://github.com/huggingface/datasets/pull/5746
@csris made their first contribution in https://github.com/huggingface/datasets/pull/5763
@eli-osherovich made their first contribution in https://github.com/huggingface/datasets/pull/5758
@maddiedawson made their first contribution in https://github.com/huggingface/datasets/pull/5701

Full Changelog: https://github.com/huggingface/datasets/compare/2.11.0...2.12.0