2.9.0 — Datasets — releases.sh

Datasets Features

Parallel implementation of to_tf_dataset() by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5377
- Pass num_workers= to .to_tf_dataset() to make your dataset faster with multiprocessing
Distributed support by @lhoestq in https://github.com/huggingface/datasets/pull/5369
- Split your dataset for each node for distributed training
- It supports both Dataset and IterableDataset (e.g. in streaming mode)
- See the documentation for more details
```
import os
from datasets.distributed import split_dataset_by_node

rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
```
Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in https://github.com/huggingface/datasets/pull/5400
Tqdm progress bar for to_parquet by @zanussbaum in https://github.com/huggingface/datasets/pull/5456
ZIP files support in iter_archive with better compression type check by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3379
Support other formats than uint8 for image arrays by @vigsterkr in https://github.com/huggingface/datasets/pull/5365

Documentation

Depth estimation dataset guide by @sayakpaul in https://github.com/huggingface/datasets/pull/5379
- see https://huggingface.co/docs/datasets/main/en/depth_estimation
Imagefolder docs: mention support of CSV and ZIP by @lhoestq in https://github.com/huggingface/datasets/pull/5463
- see https://huggingface.co/docs/datasets/main/en/image_load#imagefolder
Update docs of S3 filesystem with async aiobotocore by @maheshpec in https://github.com/huggingface/datasets/pull/5411
- see https://huggingface.co/docs/datasets/main/en/filesystems#amazon-s3

General improvements and bug fixes

Raise error if ClassLabel names is not python list by @freddyheppell in https://github.com/huggingface/datasets/pull/5359
Temporarily pin pydantic test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/5395
Unpin pydantic test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/5397
Replace one letter import in docs by @MKhalusova in https://github.com/huggingface/datasets/pull/5403
Fix Colab notebook link by @albertvillanova in https://github.com/huggingface/datasets/pull/5392
Fix fs.open resource leaks by @tkukurin in https://github.com/huggingface/datasets/pull/5358
Fix deprecation warning when use_auth_token passed to download_and_prepare by @albertvillanova in https://github.com/huggingface/datasets/pull/5409
Fix streaming pandas.read_excel by @albertvillanova in https://github.com/huggingface/datasets/pull/5372
ci: 🎡 remove two obsolete issue templates by @severo in https://github.com/huggingface/datasets/pull/5420
Handle 0-dim tensors in cast_to_python_objects by @mariosasko in https://github.com/huggingface/datasets/pull/5384
Fix CI by temporarily pinning apache-beam < 2.44.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5429
Fix CI benchmarks by temporarily pinning Docker image version by @albertvillanova in https://github.com/huggingface/datasets/pull/5432
Revert container image pin in CI benchmarks by @0x2b3bfa0 in https://github.com/huggingface/datasets/pull/5436
Finish deprecating the fs argument by @dconathan in https://github.com/huggingface/datasets/pull/5393
Update actions/checkout in CD Conda release by @albertvillanova in https://github.com/huggingface/datasets/pull/5438
Fix RuntimeError: Sharding is ambiguous for this dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/5416
Fix documentation about batch samplers by @thomasw21 in https://github.com/huggingface/datasets/pull/5440
Fix CI by temporarily pinning fsspec < 2023.1.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5447
Support fsspec 2023.1.0 in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/5449
Update share tutorial by @stevhliu in https://github.com/huggingface/datasets/pull/5443
Swap log messages for symbolic/hard links in tar extractor by @albertvillanova in https://github.com/huggingface/datasets/pull/5452
Fix base directory while extracting insecure TAR files by @albertvillanova in https://github.com/huggingface/datasets/pull/5453
Fix link in load_dataset docstring by @mariosasko in https://github.com/huggingface/datasets/pull/5389
Document that removing all the columns returns an empty document and the num_row is lost by @thomasw21 in https://github.com/huggingface/datasets/pull/5460
Concatenate on axis=1 with misaligned blocks by @lhoestq in https://github.com/huggingface/datasets/pull/5462
Raise from disconnect error in xopen by @lhoestq in https://github.com/huggingface/datasets/pull/5382
remove pathlib.Path with URIs by @jonny-cyberhaven in https://github.com/huggingface/datasets/pull/5466
Remove deprecated shard_size arg from .push_to_hub() by @polinaeterna in https://github.com/huggingface/datasets/pull/5469

New Contributors

@freddyheppell made their first contribution in https://github.com/huggingface/datasets/pull/5359
@MKhalusova made their first contribution in https://github.com/huggingface/datasets/pull/5403
@tkukurin made their first contribution in https://github.com/huggingface/datasets/pull/5358
@0x2b3bfa0 made their first contribution in https://github.com/huggingface/datasets/pull/5436
@maheshpec made their first contribution in https://github.com/huggingface/datasets/pull/5411
@dconathan made their first contribution in https://github.com/huggingface/datasets/pull/5393
@zanussbaum made their first contribution in https://github.com/huggingface/datasets/pull/5456
@jonny-cyberhaven made their first contribution in https://github.com/huggingface/datasets/pull/5466

Full Changelog: https://github.com/huggingface/datasets/compare/2.8.0...2.9.0