2.10.0 — Datasets — releases.sh

Important

Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in https://github.com/huggingface/datasets/pull/5542
- Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
Skip dataset verifications by default by @mariosasko in https://github.com/huggingface/datasets/pull/5303
- introduces multiple verification_mode you can pass to `load_dataset()):
- the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

Single TQDM bar in multi-proc map by @mariosasko in https://github.com/huggingface/datasets/pull/5455
- No more stacked TQDM bars when calling .map() in multiprocessing
Map-style Dataset to IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5410
- introduces .to_iterable_dataset() to get a IterableDataset from a Dataset
- see all the advantages of IterableDataset in the documentation about the differences between Dataset and IterableDataset
Select columns of Dataset or DatasetDict by @daskol in https://github.com/huggingface/datasets/pull/5480
- introduces .select_column() to return a dataset only containing the requested columns
Added functionality: sort datasets by multiple keys by @MichlF in https://github.com/huggingface/datasets/pull/5502
- introduces ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
Add JAX device selection when formatting by @alvarobartt in https://github.com/huggingface/datasets/pull/5547
- introduces ds = ds.with_format("jax", device=device)
Reload features from Parquet metadata by @MFreidank in https://github.com/huggingface/datasets/pull/5516
Speed up batched PyTorch DataLoader by @lhoestq in https://github.com/huggingface/datasets/pull/5512

Documentation

Add section in tutorial for IterableDataset by @stevhliu in https://github.com/huggingface/datasets/pull/5485
- https://huggingface.co/docs/datasets/main/en/access#iterabledataset
Tutorial for creating a dataset by @stevhliu in https://github.com/huggingface/datasets/pull/5540
- https://huggingface.co/docs/datasets/main/en/create_dataset
Add JAX-formatting documentation by @alvarobartt in https://github.com/huggingface/datasets/pull/5535
- https://huggingface.co/docs/datasets/main/en/use_with_jax

General improvements and bug fixes

Pin sqlalchemy by @lhoestq in https://github.com/huggingface/datasets/pull/5476
Update dataset card creation by @stevhliu in https://github.com/huggingface/datasets/pull/5470
Add num_test_batches option by @amyeroberts in https://github.com/huggingface/datasets/pull/5471
Tip for recomputing metadata by @stevhliu in https://github.com/huggingface/datasets/pull/5478
Disable aiohttp requoting of redirection URL by @albertvillanova in https://github.com/huggingface/datasets/pull/5459
[MINOR] Typo by @cakiki in https://github.com/huggingface/datasets/pull/5491
Pin dill lower version by @albertvillanova in https://github.com/huggingface/datasets/pull/5489
Improved error message for gated/private repos by @osanseviero in https://github.com/huggingface/datasets/pull/5497
Update docs for nyu_depth_v2 dataset by @awsaf49 in https://github.com/huggingface/datasets/pull/5484
don't zero copy timestamps by @dwyatte in https://github.com/huggingface/datasets/pull/5504
Remove unused load_from_cache_file arg from Dataset.shard() docstring by @polinaeterna in https://github.com/huggingface/datasets/pull/5493
Do not add index column by default when exporting to CSV by @albertvillanova in https://github.com/huggingface/datasets/pull/5490
Fix bug when casting empty array to class labels by @marioga in https://github.com/huggingface/datasets/pull/5521
Fix benchmarks CI - pin protobuf by @lhoestq in https://github.com/huggingface/datasets/pull/5527
Remove py.typed by @mariosasko in https://github.com/huggingface/datasets/pull/5518
Add missing license in NumpyFormatter by @alvarobartt in https://github.com/huggingface/datasets/pull/5530
Unify load_from_cache_file type and logic by @HallerPatrick in https://github.com/huggingface/datasets/pull/5515
Format code with ruff by @mariosasko in https://github.com/huggingface/datasets/pull/5519
Minor changes in JAX-formatting docstrings & type-hints by @alvarobartt in https://github.com/huggingface/datasets/pull/5522
Resolve four broken refs in the docs by @tomaarsen in https://github.com/huggingface/datasets/pull/5550
Use default audio resampling type by @lhoestq in https://github.com/huggingface/datasets/pull/5556
- resampy is no longer needed to resample audio data
improved message error row formatting by @Plutone11011 in https://github.com/huggingface/datasets/pull/5553
Make tiktoken tokenizers hashable by @mariosasko in https://github.com/huggingface/datasets/pull/5552
Suggest scikit-learn instead of sklearn by @osbm in https://github.com/huggingface/datasets/pull/5551
Add filter desc by @lhoestq in https://github.com/huggingface/datasets/pull/5557
Fix map suffix_template by @lhoestq in https://github.com/huggingface/datasets/pull/5559
Ensure last tqdm update in map by @mariosasko in https://github.com/huggingface/datasets/pull/5560

New Contributors

@amyeroberts made their first contribution in https://github.com/huggingface/datasets/pull/5471
@awsaf49 made their first contribution in https://github.com/huggingface/datasets/pull/5484
@dwyatte made their first contribution in https://github.com/huggingface/datasets/pull/5504
@marioga made their first contribution in https://github.com/huggingface/datasets/pull/5521
@MFreidank made their first contribution in https://github.com/huggingface/datasets/pull/5516
@daskol made their first contribution in https://github.com/huggingface/datasets/pull/5480
@Plutone11011 made their first contribution in https://github.com/huggingface/datasets/pull/5553
@osbm made their first contribution in https://github.com/huggingface/datasets/pull/5551
@MichlF made their first contribution in https://github.com/huggingface/datasets/pull/5502

Full Changelog: https://github.com/huggingface/datasets/compare/2.9.0...ef