Hugging Face/Datasets

Datasets

$npx -y @buildinternet/releases show datasets

Sun

Mon

Tue

Wed

Thu

Fri

Sat

AprMayJunJulAugSepOctNovDecJanFebMarApr

Less

Releases8Avg3/moVersionsv4.5.0 → v4.8.3

Nov 16, 2023

What's Changed

Fix typo in Audio dataset documentation by @prassanna-ravishankar in https://github.com/huggingface/datasets/pull/6222
Add push_to_hub with multiple configs docs by @lhoestq in https://github.com/huggingface/datasets/pull/6226
Remove RGB -> BGR image conversion in Object Detection tutorial by @mariosasko in https://github.com/huggingface/datasets/pull/6228
Update README.md by @NinoRisteski in https://github.com/huggingface/datasets/pull/6233
Don't skip hidden files in dl_manager.iter_files when they are given as input by @mariosasko in https://github.com/huggingface/datasets/pull/6230
Update README.md by @NinoRisteski in https://github.com/huggingface/datasets/pull/6223
Remove unused global variables in audio.py by @mariosasko in https://github.com/huggingface/datasets/pull/6241
Improve error message for missing function parameters by @suavemint in https://github.com/huggingface/datasets/pull/6232
Fix cast from fixed size list to variable size list by @mariosasko in https://github.com/huggingface/datasets/pull/6243
Update create_dataset.mdx by @EswarDivi in https://github.com/huggingface/datasets/pull/6247
[DOCS] Fix typo: Elasticsearch by @leemthompo in https://github.com/huggingface/datasets/pull/6258
Support streaming datasets with pyarrow.parquet.read_table by @albertvillanova in https://github.com/huggingface/datasets/pull/6251
Temporarily pin tensorflow < 2.14.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6264
Fix CI 404 errors by @albertvillanova in https://github.com/huggingface/datasets/pull/6262
Remove apache_beam import in BeamBasedBuilder._save_info by @mariosasko in https://github.com/huggingface/datasets/pull/6265
Improve documentation of dataset.from_generator by @hartmans in https://github.com/huggingface/datasets/pull/6281
Fix parquet columns argument in streaming mode by @lhoestq in https://github.com/huggingface/datasets/pull/6295
Doc readme improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6298
Unpin tensorflow maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6301
Unpin jax maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6300
Fix ArrayXD cast by @mariosasko in https://github.com/huggingface/datasets/pull/6297
Reduce the number of commits in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6269
Fix typo in code example in docs by @bryant1410 in https://github.com/huggingface/datasets/pull/6307
Update README.md by @smty2018 in https://github.com/huggingface/datasets/pull/6304
Deterministic set hash by @lhoestq in https://github.com/huggingface/datasets/pull/6318
docs: resolving namespace conflict, refactored variable by @smty2018 in https://github.com/huggingface/datasets/pull/6312
Fix typos by @python273 in https://github.com/huggingface/datasets/pull/6321
Fix commit message formatting in multi-commit uploads by @qgallouedec in https://github.com/huggingface/datasets/pull/6313
Temporarily pin fsspec < 2023.10.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6331
Unpin fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/6336
Fix use_dataset.mdx by @angel-luis in https://github.com/huggingface/datasets/pull/6351
Add fsspec version to the datasets-cli env command output by @mariosasko in https://github.com/huggingface/datasets/pull/6356
Expanduser in save_to_disk() by @Unknown3141592 in https://github.com/huggingface/datasets/pull/6098
Fix time measuring snippet in docs by @mariosasko in https://github.com/huggingface/datasets/pull/6367
Temporarily pin pyarrow < 14.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6375
Fix typo in Dataset.map docstring by @bryant1410 in https://github.com/huggingface/datasets/pull/6373
Avoid redundant warning when encoding NumPy array as Image by @mariosasko in https://github.com/huggingface/datasets/pull/6379
Replace deprecated license_file in setup.cfg by @albertvillanova in https://github.com/huggingface/datasets/pull/6332
Minor release step improvement by @lhoestq in https://github.com/huggingface/datasets/pull/6339
Fix dependency conflict within CI build documentation by @albertvillanova in https://github.com/huggingface/datasets/pull/6411
Remove redundant condition in builders by @albertvillanova in https://github.com/huggingface/datasets/pull/6398
Handle future deprecation argument by @winglian in https://github.com/huggingface/datasets/pull/6390
Remove token value from warnings by @mariosasko in https://github.com/huggingface/datasets/pull/6418
Rename audio_classificiation.py to audio_classification.py by @carlthome in https://github.com/huggingface/datasets/pull/6416
Add pyarrow-hotfix to release docs by @albertvillanova in https://github.com/huggingface/datasets/pull/6421
Simplify filesystem logic by @mariosasko in https://github.com/huggingface/datasets/pull/6362
Fix conda release by adding pyarrow-hotfix dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/6423

New Contributors

@prassanna-ravishankar made their first contribution in https://github.com/huggingface/datasets/pull/6222
@NinoRisteski made their first contribution in https://github.com/huggingface/datasets/pull/6233
@suavemint made their first contribution in https://github.com/huggingface/datasets/pull/6232
@EswarDivi made their first contribution in https://github.com/huggingface/datasets/pull/6247
@leemthompo made their first contribution in https://github.com/huggingface/datasets/pull/6258
@hartmans made their first contribution in https://github.com/huggingface/datasets/pull/6281
@smty2018 made their first contribution in https://github.com/huggingface/datasets/pull/6304
@python273 made their first contribution in https://github.com/huggingface/datasets/pull/6321
@angel-luis made their first contribution in https://github.com/huggingface/datasets/pull/6351
@Unknown3141592 made their first contribution in https://github.com/huggingface/datasets/pull/6098
@winglian made their first contribution in https://github.com/huggingface/datasets/pull/6390
@carlthome made their first contribution in https://github.com/huggingface/datasets/pull/6416

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.7...2.15.0

Nov 15, 2023

Bug Fixes

Fix UnboundLocalError if preprocessing returns an empty list by @cwallenwein in https://github.com/huggingface/datasets/pull/6346
Fix python formatting for complex types in format_table by @mariosasko in https://github.com/huggingface/datasets/pull/6368
Support pyarrow 14.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6378
Do not try to download from HF GCS for generator by @yundai424 in https://github.com/huggingface/datasets/pull/6372
Support pyarrow 14.0.1 and fix vulnerability CVE-2023-47248 by @albertvillanova in https://github.com/huggingface/datasets/pull/6404

New Contributors

@cwallenwein made their first contribution in https://github.com/huggingface/datasets/pull/6346
@yundai424 made their first contribution in https://github.com/huggingface/datasets/pull/6372

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.6...2.14.7

Oct 24, 2023

What's Changed

Ignore dataset_info.json in data files resolution by @mariosasko in https://github.com/huggingface/datasets/pull/6224
Check builder cls default config name in inspect by @lhoestq in https://github.com/huggingface/datasets/pull/6253
Add support for fsspec>=2023.9.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6244
Create DefunctDatasetError by @albertvillanova in https://github.com/huggingface/datasets/pull/6286
Fix get_data_patterns for directories with the word data twice by @albertvillanova in https://github.com/huggingface/datasets/pull/6309
Fix loading Hub datasets with CSV metadata file by @albertvillanova in https://github.com/huggingface/datasets/pull/6316
datasets.filesystems: fix is_remote_filesystems by @ap-- in https://github.com/huggingface/datasets/pull/6334
Pin upper version of fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/6337
Fix regex get_data_files formatting for base paths by @ZachNagengast in https://github.com/huggingface/datasets/pull/6322

New Contributors

@ap-- made their first contribution in https://github.com/huggingface/datasets/pull/6334
@ZachNagengast made their first contribution in https://github.com/huggingface/datasets/pull/6322

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.5...2.14.6

Bug fixes

Bump fsspec from 2021.11.1 to 2022.3.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6091
Minor fix in iter_files for hidden files by @mariosasko in https://github.com/huggingface/datasets/pull/6092
Use yaml instead of get data patterns when possible by @lhoestq in https://github.com/huggingface/datasets/pull/6154
Fix Parquet loading with columns by @mariosasko in https://github.com/huggingface/datasets/pull/6160
Fix: Missing a MetadataConfigs init when the repo has a datasets_info.json but no README by @clefourrier in https://github.com/huggingface/datasets/pull/6164
PyArrow 13 CI fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6175
Don't alter input in Features.from_dict by @lhoestq in https://github.com/huggingface/datasets/pull/6189
Fix multiprocessing with spawn in iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/6165
Set minimal fsspec version requirement to 2023.1.0 by @mariosasko in https://github.com/huggingface/datasets/pull/6192
Temporarily pin pandas < 2.1.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6200
Preserve split order in DataFilesDict by @albertvillanova in https://github.com/huggingface/datasets/pull/6198
Add missing revision argument by @qgallouedec in https://github.com/huggingface/datasets/pull/6191
Temporarily pin fsspec < 2023.9.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6210
Do not filter out .zip extensions from no-script datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6208
Fix empty splitinfo json by @lhoestq in https://github.com/huggingface/datasets/pull/6211
Fix to_json ValueError and remove pandas pin by @albertvillanova in https://github.com/huggingface/datasets/pull/6201
Fix checking patterns to infer packaged builder by @polinaeterna in https://github.com/huggingface/datasets/pull/6215
Rename old push_to_hub configs to "default" in dataset_infos by @lhoestq in https://github.com/huggingface/datasets/pull/6218

Other improvements

Deprecate Dataset.export by @mariosasko in https://github.com/huggingface/datasets/pull/6081
Deprecate download_custom by @mariosasko in https://github.com/huggingface/datasets/pull/6093
Ignore CI lint rule violation in Pickler.memoize by @albertvillanova in https://github.com/huggingface/datasets/pull/6138
Remove unused allowed_extensions param by @albertvillanova in https://github.com/huggingface/datasets/pull/6135
Export to_iterable_dataset to document by @npuichigo in https://github.com/huggingface/datasets/pull/6145
[Docs] Add description of select_columns to guide by @unifyh in https://github.com/huggingface/datasets/pull/6119
Ignore parallel warning in map_nested by @lhoestq in https://github.com/huggingface/datasets/pull/6148
[docs] Complete to_iterable_dataset by @stevhliu in https://github.com/huggingface/datasets/pull/6158
Raise FileNotFoundError when passing data_files that don't exist by @lhoestq in https://github.com/huggingface/datasets/pull/6155
Fix typo in about_mapstyle_vs_iterable.mdx by @lhoestq in https://github.com/huggingface/datasets/pull/6171
Document BUILDER_CONFIG_CLASS by @lhoestq in https://github.com/huggingface/datasets/pull/6166
Fix import in image_load doc by @mariosasko in https://github.com/huggingface/datasets/pull/6181
Use object detection images from huggingface/documentation-images by @mariosasko in https://github.com/huggingface/datasets/pull/6177
Use hf-internal-testing repos for hosting test dataset repos by @mariosasko in https://github.com/huggingface/datasets/pull/6180

New Contributors

@npuichigo made their first contribution in https://github.com/huggingface/datasets/pull/6145
@unifyh made their first contribution in https://github.com/huggingface/datasets/pull/6119

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5

Sep 6, 2023

Bug fixes

Do not filter out .zip extensions from no-script datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6208

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.13.2

Aug 8, 2023

Bug fixes

Fix authentication issues by @albertvillanova in https://github.com/huggingface/datasets/pull/6127

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.3...2.14.4

Aug 3, 2023

Bug fixes

Fix error when loading from GCP bucket by @albertvillanova in https://github.com/huggingface/datasets/pull/6105
Fix deprecation of use_auth_token in file_utils by @albertvillanova in https://github.com/huggingface/datasets/pull/6107

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.2...2.14.3

Jul 31, 2023

Bug fixes

Fix deprecation of use_auth_token in DownloadConfig by @albertvillanova in https://github.com/huggingface/datasets/pull/6094
Fix deprecation of errors in TextConfig by @albertvillanova in https://github.com/huggingface/datasets/pull/6095

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.1...2.14.2

Jul 27, 2023

Bug fixes

fix tqdm lock by @lhoestq in https://github.com/huggingface/datasets/pull/6067
fix tqdm lock deletion by @lhoestq in https://github.com/huggingface/datasets/pull/6068
Fix fsspec storage_options from load_dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6072
No gzip encoding from github by @lhoestq in https://github.com/huggingface/datasets/pull/6076

Other improvements

Fix Overview.ipynb & detach Jupyter Notebooks from datasets repository by @alvarobartt in https://github.com/huggingface/datasets/pull/5902
Fix Quickstart notebook link by @mariosasko in https://github.com/huggingface/datasets/pull/6070
Remove README link to deprecated Colab notebook by @mariosasko in https://github.com/huggingface/datasets/pull/6080
Misc doc improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6074

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.0...2.14.1

Jul 24, 2023

Important: caching

Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
Datasets that were already cached are still supported.
This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in https://github.com/huggingface/datasets/pull/5331.

Dataset Configuration

Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331

Configure your dataset using YAML at the top of your dataset card (docs here)
Choose which file goes into which split

  ---
  configs:
  - config_name: default
    data_files:
    - split: train
       path: data.csv
    - split: test
        path: holdout.csv
  ---

Define multiple dataset configurations

  ---
  configs:
  - config_name: main_data
    data_files: main_data.csv
  - config_name: additional_data
    data_files: additional_data.csv
  ---

Dataset Features

Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331
- push_to_hub() additional dataset configurations
```
ds.push_to_hub("username/dataset_name", config_name="additional_data")
# reload later
ds = load_dataset("username/dataset_name", "additional_data")
```
Support returning dataframe in map transform by @mariosasko in https://github.com/huggingface/datasets/pull/5995

What's Changed

Deprecate errors param in favor of encoding_errors in text builder by @mariosasko in https://github.com/huggingface/datasets/pull/5974
Fix select_columns columns order by @lhoestq in https://github.com/huggingface/datasets/pull/5994
Replace metadata utils with huggingface_hub's RepoCard API by @mariosasko in https://github.com/huggingface/datasets/pull/5949
Pin joblib to avoid joblibspark test failures by @mariosasko in https://github.com/huggingface/datasets/pull/6000
Align column_names type check with type hint in sort by @mariosasko in https://github.com/huggingface/datasets/pull/6001
Deprecate use_auth_token in favor of token by @mariosasko in https://github.com/huggingface/datasets/pull/5996
Drop Python 3.7 support by @mariosasko in https://github.com/huggingface/datasets/pull/6005
Misc improvements by @mariosasko in https://github.com/huggingface/datasets/pull/6004
Make IterableDataset.from_spark more efficient by @mathewjacob1002 in https://github.com/huggingface/datasets/pull/5986
Fix cast for dictionaries with no keys by @mariosasko in https://github.com/huggingface/datasets/pull/6009
Avoid stuck map operation when subprocesses crashes by @pappacena in https://github.com/huggingface/datasets/pull/5976
Deprecate task api by @mariosasko in https://github.com/huggingface/datasets/pull/5865
Add metadata ui screenshot in docs by @lhoestq in https://github.com/huggingface/datasets/pull/6015
Fix ClassLabel min max check for None values by @mariosasko in https://github.com/huggingface/datasets/pull/6023
[docs] Update return statement of index search by @stevhliu in https://github.com/huggingface/datasets/pull/6021
Improve logging by @mariosasko in https://github.com/huggingface/datasets/pull/6019
Fix style with ruff 0.0.278 by @lhoestq in https://github.com/huggingface/datasets/pull/6026
Don't reference self in Spark._validate_cache_dir by @maddiedawson in https://github.com/huggingface/datasets/pull/6024
Delete task_templates in IterableDataset when they are no longer valid by @mariosasko in https://github.com/huggingface/datasets/pull/6027
[docs] Fix link by @stevhliu in https://github.com/huggingface/datasets/pull/6029
fixed typo in comment by @NightMachinery in https://github.com/huggingface/datasets/pull/6030
Fix legacy_dataset_infos by @lhoestq in https://github.com/huggingface/datasets/pull/6040
Flatten repository_structure docs on yaml by @lhoestq in https://github.com/huggingface/datasets/pull/6041
Use new hffs by @lhoestq in https://github.com/huggingface/datasets/pull/6028
Bump dev version by @lhoestq in https://github.com/huggingface/datasets/pull/6047
Fix unused DatasetInfosDict code in push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/6042
Rename "pattern" to "path" in YAML data_files configs by @lhoestq in https://github.com/huggingface/datasets/pull/6044
Remove HfFileSystem and deprecate S3FileSystem by @mariosasko in https://github.com/huggingface/datasets/pull/6052
Dill 3.7 support by @mariosasko in https://github.com/huggingface/datasets/pull/6061
Improve Dataset.from_list docstring by @mariosasko in https://github.com/huggingface/datasets/pull/6062
Check if column names match in Parquet loader only when config features are specified by @mariosasko in https://github.com/huggingface/datasets/pull/6045
Release: 2.14.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6063

New Contributors

@mathewjacob1002 made their first contribution in https://github.com/huggingface/datasets/pull/5986
@pappacena made their first contribution in https://github.com/huggingface/datasets/pull/5976

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.14.0

Jun 22, 2023

General improvements and bug fixes

Fix JSON generation in benchmarks CI by @mariosasko in https://github.com/huggingface/datasets/pull/5966
Always return list in list_datasets by @mariosasko in https://github.com/huggingface/datasets/pull/5964
Add encoding and errors params to JSON loader by @mariosasko in https://github.com/huggingface/datasets/pull/5969
Filter unsupported extensions by @lhoestq in https://github.com/huggingface/datasets/pull/5972

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.0...2.13.1

Jun 14, 2023

Dataset Features

Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770

Stream the data from your Spark DataFrame directly to your training pipeline

from datasets import IterableDataset
from torch.utils.data import DataLoader

ids = IterableDataset.from_spark(df)
ids = ids.map(...).filter(...).with_format("torch")
for batch in DataLoader(ids, batch_size=16, num_workers=4):
    ...

IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:
- IterableDataset Arrow formatting by @lhoestq in https://github.com/huggingface/datasets/pull/5821
- Iterable torch formatting by @lhoestq in https://github.com/huggingface/datasets/pull/5852
```
from datasets import load_dataset

ids = load_dataset("c4", "en", split="train", streaming=True)
ids = ids.map(...).with_format("torch")  # to get PyTorch tensors - also works with tf, np, jax etc.
```
Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893
```
from datasets import IterableDataset

ids = IterableDataset.from_file("path/to/data.arrow")
```
Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944
```
from datasets import load_dataset

ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
```

Experimental

Add parallel module using joblib for Spark by @es94129 in https://github.com/huggingface/datasets/pull/5924

General improvements and bug fixes

Preserve stopping_strategy of shuffled interleaved dataset (random cycling case) by @mariosasko in https://github.com/huggingface/datasets/pull/5816
Fix incomplete docstring for BuilderConfig by @Laurent2916 in https://github.com/huggingface/datasets/pull/5824
[docs] Custom decoding transforms by @stevhliu in https://github.com/huggingface/datasets/pull/5836
Add accelerate as metric's test dependency to fix CI error by @mariosasko in https://github.com/huggingface/datasets/pull/5848
Add date_format param to the CSV reader by @mariosasko in https://github.com/huggingface/datasets/pull/5845
[docs] Redirects, migrated from nginx by @julien-c in https://github.com/huggingface/datasets/pull/5853
Fix infer module for uppercase extensions by @albertvillanova in https://github.com/huggingface/datasets/pull/5872
Minor tqdm optim by @lhoestq in https://github.com/huggingface/datasets/pull/5860
Always set nullable fields in the writer by @lhoestq in https://github.com/huggingface/datasets/pull/5835
Add fn_kwargs to map and filter of IterableDataset and IterableDatasetDict by @yuukicammy in https://github.com/huggingface/datasets/pull/5810
Better error message when combining dataset dicts instead of datasets by @lhoestq in https://github.com/huggingface/datasets/pull/5861
Force overwrite existing filesystem protocol by @baskrahmer in https://github.com/huggingface/datasets/pull/5894
Support working_dir in from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5826
Raise TypeError when indexing a dataset with bool by @albertvillanova in https://github.com/huggingface/datasets/pull/5859
Fix minor typo in docs loading.mdx by @albertvillanova in https://github.com/huggingface/datasets/pull/5900
Fix FixedSizeListArray casting by @mariosasko in https://github.com/huggingface/datasets/pull/5897
Unpin responses by @mariosasko in https://github.com/huggingface/datasets/pull/5916
Validate name parameter in make_file_instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/5904
Raise error in DatasetBuilder.as_dataset when file_format is not "arrow" by @mariosasko in https://github.com/huggingface/datasets/pull/5915
Refactor extensions by @albertvillanova in https://github.com/huggingface/datasets/pull/5917
Use more efficient and idiomatic way to construct list. by @ttsugriy in https://github.com/huggingface/datasets/pull/5909
Add flatten_indices to DatasetDict by @maximxlss in https://github.com/huggingface/datasets/pull/5907
Optimize IterableDataset.from_file using ArrowExamplesIterable by @lhoestq in https://github.com/huggingface/datasets/pull/5920
Make prepare_split more robust if errors in metadata dataset_info splits by @albertvillanova in https://github.com/huggingface/datasets/pull/5901
Fix streaming parquet with image feature in schema by @lhoestq in https://github.com/huggingface/datasets/pull/5921
canonicalize data dir in config ID hash by @kylrth in https://github.com/huggingface/datasets/pull/5899
Fix link to quickstart docs in README.md by @mariosasko in https://github.com/huggingface/datasets/pull/5928
Fix string-encoding, make batch_size optional, and minor improvements in Dataset.to_tf_dataset by @alvarobartt in https://github.com/huggingface/datasets/pull/5883
Use a new low-memory approach for tf dataset index shuffling by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5863
[doc build] Use secrets by @mishig25 in https://github.com/huggingface/datasets/pull/5932
Fix to_numpy when None values in the sequence by @qgallouedec in https://github.com/huggingface/datasets/pull/5933
Better row group size in push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/5935
Avoid parallel redownload in cache by @albertvillanova in https://github.com/huggingface/datasets/pull/5937
Better filenotfound for gated by @lhoestq in https://github.com/huggingface/datasets/pull/5954
Make get_from_cache use custom temp filename that is locked by @albertvillanova in https://github.com/huggingface/datasets/pull/5938
Fix ArrowExamplesIterable.shard_data_sources by @lhoestq in https://github.com/huggingface/datasets/pull/5956
Add Arrow builder docs by @lhoestq in https://github.com/huggingface/datasets/pull/5952
Fix sequence of array support for most dtype by @qgallouedec in https://github.com/huggingface/datasets/pull/5948

New Contributors

@Laurent2916 made their first contribution in https://github.com/huggingface/datasets/pull/5824
@yuukicammy made their first contribution in https://github.com/huggingface/datasets/pull/5810
@baskrahmer made their first contribution in https://github.com/huggingface/datasets/pull/5894
@ttsugriy made their first contribution in https://github.com/huggingface/datasets/pull/5909
@maximxlss made their first contribution in https://github.com/huggingface/datasets/pull/5907
@mariusz-jachimowicz-83 made their first contribution in https://github.com/huggingface/datasets/pull/5893
@kylrth made their first contribution in https://github.com/huggingface/datasets/pull/5899
@qgallouedec made their first contribution in https://github.com/huggingface/datasets/pull/5933
@es94129 made their first contribution in https://github.com/huggingface/datasets/pull/5924

Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef

Apr 28, 2023

Datasets Features

Add Dataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5701
- Get a Dataset from a Spark DataFrame (docs):
```
>>> from datasets import Dataset
>>> ds = Dataset.from_spark(df)
```

Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in https://github.com/huggingface/datasets/pull/5689

Stream data from Wikipedia:

>>> from datasets import load_dataset
>>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
>>> next(iter(ds["train"]))
{'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}

Implement sharding on merged iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5735

Use interleaved datasets in a distributed setup or with a DataLoader

>>> from datasets import load_dataset, interleave_datasets
>>> from torch.utils.data import DataLoader
>>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
>>> c4 = load_dataset("c4", "en", split="train", streaming=True)
>>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
>>> dataloader = DataLoader(merged, num_workers=4)

Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in https://github.com/huggingface/datasets/pull/5751
- Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
- Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
- Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

Fix a description error for interleave_datasets. by @QizhiPei in https://github.com/huggingface/datasets/pull/5680
[docs] Split pattern search order by @stevhliu in https://github.com/huggingface/datasets/pull/5693
Raise an error on missing distributed seed by @lhoestq in https://github.com/huggingface/datasets/pull/5697
Fix xnumpy_load for .npz files by @albertvillanova in https://github.com/huggingface/datasets/pull/5714
Temporarily pin fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/5731
Unpin fsspec by @albertvillanova in https://github.com/huggingface/datasets/pull/5733
Fix CI warnings by @albertvillanova in https://github.com/huggingface/datasets/pull/5741
Fix CI mock filesystem fixtures by @albertvillanova in https://github.com/huggingface/datasets/pull/5740
Fix link in docs by @bbbxyz in https://github.com/huggingface/datasets/pull/5746
fix typo: "mow" -> "now" by @csris in https://github.com/huggingface/datasets/pull/5763
[docs] Compress data files by @stevhliu in https://github.com/huggingface/datasets/pull/5691
Fix style by @lhoestq in https://github.com/huggingface/datasets/pull/5774
Minor tqdm fixes by @mariosasko in https://github.com/huggingface/datasets/pull/5754
Fixes #5757 by @eli-osherovich in https://github.com/huggingface/datasets/pull/5758
Fix JSON builder when missing keys in first row by @albertvillanova in https://github.com/huggingface/datasets/pull/5772
Warning specifying future change in to_tf_dataset behaviour by @amyeroberts in https://github.com/huggingface/datasets/pull/5742
Prepare tests for hfh 0.14 by @Wauplin in https://github.com/huggingface/datasets/pull/5788
Call fs.makedirs in save_to_disk by @lhoestq in https://github.com/huggingface/datasets/pull/5779
Allow to run CI on push to ci-branch by @albertvillanova in https://github.com/huggingface/datasets/pull/5790
Fix nondeterministic sharded data split order by @albertvillanova in https://github.com/huggingface/datasets/pull/5729
Raise subprocesses traceback when interrupting by @lhoestq in https://github.com/huggingface/datasets/pull/5784
Fix spark imports by @lhoestq in https://github.com/huggingface/datasets/pull/5795
Change downloaded file permission based on umask by @albertvillanova in https://github.com/huggingface/datasets/pull/5800
Fix inferring module for unsupported data files by @albertvillanova in https://github.com/huggingface/datasets/pull/5787
Reorder default data splits to have validation before test by @albertvillanova in https://github.com/huggingface/datasets/pull/5718
Validate non-empty data_files by @albertvillanova in https://github.com/huggingface/datasets/pull/5802
Spark docs by @lhoestq in https://github.com/huggingface/datasets/pull/5796
Release: 2.12.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5803

New Contributors

@QizhiPei made their first contribution in https://github.com/huggingface/datasets/pull/5680
@bbbxyz made their first contribution in https://github.com/huggingface/datasets/pull/5746
@csris made their first contribution in https://github.com/huggingface/datasets/pull/5763
@eli-osherovich made their first contribution in https://github.com/huggingface/datasets/pull/5758
@maddiedawson made their first contribution in https://github.com/huggingface/datasets/pull/5701

Full Changelog: https://github.com/huggingface/datasets/compare/2.11.0...2.12.0

Mar 29, 2023

Important

Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in https://github.com/huggingface/datasets/pull/5573
- this allows to not have dependencies on pytorch to decode audio files
- this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support
Deprecated batch_size on Dataset.to_dict()

Datasets Features

Add writer_batch_size for ArrowBasedBuilder by @lhoestq in https://github.com/huggingface/datasets/pull/5565
- allow to specofy the row group / record batch size when you download_and_prepare() a dataset
Experimental support of cloud storage in load_dataset():
- Support cloud storage in load_dataset via fsspec by @dwyatte in https://github.com/huggingface/datasets/pull/5580
- Pass down storage options by @dwyatte in https://github.com/huggingface/datasets/pull/5673
Support PyArrow arrays as column values in from_dict by @mariosasko in https://github.com/huggingface/datasets/pull/5643
Allow direct cast from binary to Audio/Image by @mariosasko in https://github.com/huggingface/datasets/pull/5644
Add column_names to IterableDataset by @patrickloeber in https://github.com/huggingface/datasets/pull/5582
pass the dataset features to the IterableDataset.from_generator function by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5569
add Dataset.to_list by @kyoto7250 in https://github.com/huggingface/datasets/pull/5611

General imrovements and bug fixes

Update csv.py by @XDoubleU in https://github.com/huggingface/datasets/pull/5562
Remove instructions for ffmpeg system package installation on Colab by @polinaeterna in https://github.com/huggingface/datasets/pull/5558
Apply ruff flake8-comprehension checks by @Skylion007 in https://github.com/huggingface/datasets/pull/5549
Fix datasets.load_from_disk, DatasetDict.load_from_disk and Dataset.load_from_disk by @alvarobartt in https://github.com/huggingface/datasets/pull/5529
Add pre-commit config yaml file to enable automatic code formatting by @polinaeterna in https://github.com/huggingface/datasets/pull/5561
Add huggingface_hub version to env cli command by @mariosasko in https://github.com/huggingface/datasets/pull/5578
Do no write index by default when exporting a dataset by @mariosasko in https://github.com/huggingface/datasets/pull/5583
Flatten dataset on the fly in save_to_disk by @mariosasko in https://github.com/huggingface/datasets/pull/5588
Fix sort with indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/5587
Fix docstring example by @stevhliu in https://github.com/huggingface/datasets/pull/5592
Fix push_to_hub with no dataset_infos by @lhoestq in https://github.com/huggingface/datasets/pull/5598
Don't compute checksums if not necessary in datasets-cli test by @lhoestq in https://github.com/huggingface/datasets/pull/5603
Update README logo by @gary149 in https://github.com/huggingface/datasets/pull/5605
Fix CI by temporarily pinning fsspec < 2023.3.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5617
Fix archive fs test by @lhoestq in https://github.com/huggingface/datasets/pull/5614
unpin fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/5619
Bump pyarrow to 8.0.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5620
Remove set_access_token usage + fail tests if FutureWarning by @Wauplin in https://github.com/huggingface/datasets/pull/5623
Fix outdated verification_mode values by @polinaeterna in https://github.com/huggingface/datasets/pull/5607
Adding Oracle Cloud to docs by @ahosler in https://github.com/huggingface/datasets/pull/5621
Fix CI: ignore C901 ("some_func" is to complex) in ruff by @polinaeterna in https://github.com/huggingface/datasets/pull/5636
add kwargs to index search by @SaulLu in https://github.com/huggingface/datasets/pull/5628
Less zip false positives by @lhoestq in https://github.com/huggingface/datasets/pull/5640
Allow self as key in Features by @mariosasko in https://github.com/huggingface/datasets/pull/5646
Bump hfh to 0.11.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5642
Support streaming datasets with numpy.load by @albertvillanova in https://github.com/huggingface/datasets/pull/5626
Fix unnecessary dict comprehension by @albertvillanova in https://github.com/huggingface/datasets/pull/5662
Fix CI by temporarily pinning tensorflow < 2.12.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5664
Copy features by @lhoestq in https://github.com/huggingface/datasets/pull/5652
Improve features decoding in to_iterable_dataset by @lhoestq in https://github.com/huggingface/datasets/pull/5655
Fix fsspec.open when using an HTTP proxy by @bryant1410 in https://github.com/huggingface/datasets/pull/5656
Jax requires jaxlib by @lhoestq in https://github.com/huggingface/datasets/pull/5667
docs: Update num_shards docs to mention num_proc on Dataset and DatasetDict by @connor-henderson in https://github.com/huggingface/datasets/pull/5658
Allow loading/saving of FAISS index using fsspec by @Dref360 in https://github.com/huggingface/datasets/pull/5526
Fix verification_mode when ignore_verifications is passed by @albertvillanova in https://github.com/huggingface/datasets/pull/5683
Release: 2.11.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5684

New Contributors

@XDoubleU made their first contribution in https://github.com/huggingface/datasets/pull/5562
@Skylion007 made their first contribution in https://github.com/huggingface/datasets/pull/5549
@Hubert-Bonisseur made their first contribution in https://github.com/huggingface/datasets/pull/5569
@ahosler made their first contribution in https://github.com/huggingface/datasets/pull/5621
@patrickloeber made their first contribution in https://github.com/huggingface/datasets/pull/5582
@SaulLu made their first contribution in https://github.com/huggingface/datasets/pull/5628
@connor-henderson made their first contribution in https://github.com/huggingface/datasets/pull/5658
@kyoto7250 made their first contribution in https://github.com/huggingface/datasets/pull/5611

Full Changelog: https://github.com/huggingface/datasets/compare/2.10.0...2.11.0

Feb 28, 2023

What's Changed

Fix sort with indices mapping by @mariosasko https://github.com/huggingface/datasets/pull/5587
- Fix IndexError when doing ds.filter(...).sort(...) or ds.select(...).sort(...)

Full Changelog: https://github.com/huggingface/datasets/compare/2.10.0...2.10.1

Feb 22, 2023

Important

Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in https://github.com/huggingface/datasets/pull/5542
- Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
Skip dataset verifications by default by @mariosasko in https://github.com/huggingface/datasets/pull/5303
- introduces multiple verification_mode you can pass to `load_dataset()):
- the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

Single TQDM bar in multi-proc map by @mariosasko in https://github.com/huggingface/datasets/pull/5455
- No more stacked TQDM bars when calling .map() in multiprocessing
Map-style Dataset to IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5410
- introduces .to_iterable_dataset() to get a IterableDataset from a Dataset
- see all the advantages of IterableDataset in the documentation about the differences between Dataset and IterableDataset
Select columns of Dataset or DatasetDict by @daskol in https://github.com/huggingface/datasets/pull/5480
- introduces .select_column() to return a dataset only containing the requested columns
Added functionality: sort datasets by multiple keys by @MichlF in https://github.com/huggingface/datasets/pull/5502
- introduces ds = ds.sort(['col_1', 'col_2'], reverse=[True, False])
Add JAX device selection when formatting by @alvarobartt in https://github.com/huggingface/datasets/pull/5547
- introduces ds = ds.with_format("jax", device=device)
Reload features from Parquet metadata by @MFreidank in https://github.com/huggingface/datasets/pull/5516
Speed up batched PyTorch DataLoader by @lhoestq in https://github.com/huggingface/datasets/pull/5512

Documentation

Add section in tutorial for IterableDataset by @stevhliu in https://github.com/huggingface/datasets/pull/5485
- https://huggingface.co/docs/datasets/main/en/access#iterabledataset
Tutorial for creating a dataset by @stevhliu in https://github.com/huggingface/datasets/pull/5540
- https://huggingface.co/docs/datasets/main/en/create_dataset
Add JAX-formatting documentation by @alvarobartt in https://github.com/huggingface/datasets/pull/5535
- https://huggingface.co/docs/datasets/main/en/use_with_jax

General improvements and bug fixes

Pin sqlalchemy by @lhoestq in https://github.com/huggingface/datasets/pull/5476
Update dataset card creation by @stevhliu in https://github.com/huggingface/datasets/pull/5470
Add num_test_batches option by @amyeroberts in https://github.com/huggingface/datasets/pull/5471
Tip for recomputing metadata by @stevhliu in https://github.com/huggingface/datasets/pull/5478
Disable aiohttp requoting of redirection URL by @albertvillanova in https://github.com/huggingface/datasets/pull/5459
[MINOR] Typo by @cakiki in https://github.com/huggingface/datasets/pull/5491
Pin dill lower version by @albertvillanova in https://github.com/huggingface/datasets/pull/5489
Improved error message for gated/private repos by @osanseviero in https://github.com/huggingface/datasets/pull/5497
Update docs for nyu_depth_v2 dataset by @awsaf49 in https://github.com/huggingface/datasets/pull/5484
don't zero copy timestamps by @dwyatte in https://github.com/huggingface/datasets/pull/5504
Remove unused load_from_cache_file arg from Dataset.shard() docstring by @polinaeterna in https://github.com/huggingface/datasets/pull/5493
Do not add index column by default when exporting to CSV by @albertvillanova in https://github.com/huggingface/datasets/pull/5490
Fix bug when casting empty array to class labels by @marioga in https://github.com/huggingface/datasets/pull/5521
Fix benchmarks CI - pin protobuf by @lhoestq in https://github.com/huggingface/datasets/pull/5527
Remove py.typed by @mariosasko in https://github.com/huggingface/datasets/pull/5518
Add missing license in NumpyFormatter by @alvarobartt in https://github.com/huggingface/datasets/pull/5530
Unify load_from_cache_file type and logic by @HallerPatrick in https://github.com/huggingface/datasets/pull/5515
Format code with ruff by @mariosasko in https://github.com/huggingface/datasets/pull/5519
Minor changes in JAX-formatting docstrings & type-hints by @alvarobartt in https://github.com/huggingface/datasets/pull/5522
Resolve four broken refs in the docs by @tomaarsen in https://github.com/huggingface/datasets/pull/5550
Use default audio resampling type by @lhoestq in https://github.com/huggingface/datasets/pull/5556
- resampy is no longer needed to resample audio data
improved message error row formatting by @Plutone11011 in https://github.com/huggingface/datasets/pull/5553
Make tiktoken tokenizers hashable by @mariosasko in https://github.com/huggingface/datasets/pull/5552
Suggest scikit-learn instead of sklearn by @osbm in https://github.com/huggingface/datasets/pull/5551
Add filter desc by @lhoestq in https://github.com/huggingface/datasets/pull/5557
Fix map suffix_template by @lhoestq in https://github.com/huggingface/datasets/pull/5559
Ensure last tqdm update in map by @mariosasko in https://github.com/huggingface/datasets/pull/5560

New Contributors

@amyeroberts made their first contribution in https://github.com/huggingface/datasets/pull/5471
@awsaf49 made their first contribution in https://github.com/huggingface/datasets/pull/5484
@dwyatte made their first contribution in https://github.com/huggingface/datasets/pull/5504
@marioga made their first contribution in https://github.com/huggingface/datasets/pull/5521
@MFreidank made their first contribution in https://github.com/huggingface/datasets/pull/5516
@daskol made their first contribution in https://github.com/huggingface/datasets/pull/5480
@Plutone11011 made their first contribution in https://github.com/huggingface/datasets/pull/5553
@osbm made their first contribution in https://github.com/huggingface/datasets/pull/5551
@MichlF made their first contribution in https://github.com/huggingface/datasets/pull/5502

Full Changelog: https://github.com/huggingface/datasets/compare/2.9.0...ef

Jan 26, 2023

Datasets Features

Parallel implementation of to_tf_dataset() by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5377
- Pass num_workers= to .to_tf_dataset() to make your dataset faster with multiprocessing
Distributed support by @lhoestq in https://github.com/huggingface/datasets/pull/5369
- Split your dataset for each node for distributed training
- It supports both Dataset and IterableDataset (e.g. in streaming mode)
- See the documentation for more details
```
import os
from datasets.distributed import split_dataset_by_node

rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
```
Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in https://github.com/huggingface/datasets/pull/5400
Tqdm progress bar for to_parquet by @zanussbaum in https://github.com/huggingface/datasets/pull/5456
ZIP files support in iter_archive with better compression type check by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3379
Support other formats than uint8 for image arrays by @vigsterkr in https://github.com/huggingface/datasets/pull/5365

Documentation

Depth estimation dataset guide by @sayakpaul in https://github.com/huggingface/datasets/pull/5379
- see https://huggingface.co/docs/datasets/main/en/depth_estimation
Imagefolder docs: mention support of CSV and ZIP by @lhoestq in https://github.com/huggingface/datasets/pull/5463
- see https://huggingface.co/docs/datasets/main/en/image_load#imagefolder
Update docs of S3 filesystem with async aiobotocore by @maheshpec in https://github.com/huggingface/datasets/pull/5411
- see https://huggingface.co/docs/datasets/main/en/filesystems#amazon-s3

General improvements and bug fixes

Raise error if ClassLabel names is not python list by @freddyheppell in https://github.com/huggingface/datasets/pull/5359
Temporarily pin pydantic test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/5395
Unpin pydantic test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/5397
Replace one letter import in docs by @MKhalusova in https://github.com/huggingface/datasets/pull/5403
Fix Colab notebook link by @albertvillanova in https://github.com/huggingface/datasets/pull/5392
Fix fs.open resource leaks by @tkukurin in https://github.com/huggingface/datasets/pull/5358
Fix deprecation warning when use_auth_token passed to download_and_prepare by @albertvillanova in https://github.com/huggingface/datasets/pull/5409
Fix streaming pandas.read_excel by @albertvillanova in https://github.com/huggingface/datasets/pull/5372
ci: 🎡 remove two obsolete issue templates by @severo in https://github.com/huggingface/datasets/pull/5420
Handle 0-dim tensors in cast_to_python_objects by @mariosasko in https://github.com/huggingface/datasets/pull/5384
Fix CI by temporarily pinning apache-beam < 2.44.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5429
Fix CI benchmarks by temporarily pinning Docker image version by @albertvillanova in https://github.com/huggingface/datasets/pull/5432
Revert container image pin in CI benchmarks by @0x2b3bfa0 in https://github.com/huggingface/datasets/pull/5436
Finish deprecating the fs argument by @dconathan in https://github.com/huggingface/datasets/pull/5393
Update actions/checkout in CD Conda release by @albertvillanova in https://github.com/huggingface/datasets/pull/5438
Fix RuntimeError: Sharding is ambiguous for this dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/5416
Fix documentation about batch samplers by @thomasw21 in https://github.com/huggingface/datasets/pull/5440
Fix CI by temporarily pinning fsspec < 2023.1.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/5447
Support fsspec 2023.1.0 in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/5449
Update share tutorial by @stevhliu in https://github.com/huggingface/datasets/pull/5443
Swap log messages for symbolic/hard links in tar extractor by @albertvillanova in https://github.com/huggingface/datasets/pull/5452
Fix base directory while extracting insecure TAR files by @albertvillanova in https://github.com/huggingface/datasets/pull/5453
Fix link in load_dataset docstring by @mariosasko in https://github.com/huggingface/datasets/pull/5389
Document that removing all the columns returns an empty document and the num_row is lost by @thomasw21 in https://github.com/huggingface/datasets/pull/5460
Concatenate on axis=1 with misaligned blocks by @lhoestq in https://github.com/huggingface/datasets/pull/5462
Raise from disconnect error in xopen by @lhoestq in https://github.com/huggingface/datasets/pull/5382
remove pathlib.Path with URIs by @jonny-cyberhaven in https://github.com/huggingface/datasets/pull/5466
Remove deprecated shard_size arg from .push_to_hub() by @polinaeterna in https://github.com/huggingface/datasets/pull/5469

New Contributors

@freddyheppell made their first contribution in https://github.com/huggingface/datasets/pull/5359
@MKhalusova made their first contribution in https://github.com/huggingface/datasets/pull/5403
@tkukurin made their first contribution in https://github.com/huggingface/datasets/pull/5358
@0x2b3bfa0 made their first contribution in https://github.com/huggingface/datasets/pull/5436
@maheshpec made their first contribution in https://github.com/huggingface/datasets/pull/5411
@dconathan made their first contribution in https://github.com/huggingface/datasets/pull/5393
@zanussbaum made their first contribution in https://github.com/huggingface/datasets/pull/5456
@jonny-cyberhaven made their first contribution in https://github.com/huggingface/datasets/pull/5466

Full Changelog: https://github.com/huggingface/datasets/compare/2.8.0...2.9.0

Dec 19, 2022

Important

Removed YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277
- From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
- The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
- Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

Fix methods using IterableDataset.map that lead to features=None by @alvarobartt in https://github.com/huggingface/datasets/pull/5287
- Datasets in streaming mode now update their features after column renaming or removal
Add num_proc to from_csv/generator/json/parquet/text by @lhoestq in https://github.com/huggingface/datasets/pull/5239
- Use multiprocessing to load multiple files in parallel
Add features param to IterableDataset.map by @alvarobartt in https://github.com/huggingface/datasets/pull/5311
Sharded save_to_disk + multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/5268
- Pass num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()
- Pass num_proc to use multiprocessing.
Support for decoding Image/Audio types in map when format type is not default one by @mariosasko in https://github.com/huggingface/datasets/pull/5252
Support torch dataloader without torch formatting for IterableDataset by @lhoestq in https://github.com/huggingface/datasets/pull/5357
- You can now pass any dataset in streaming mode to a PyTorch DataLoader directly:
```
from datasets import load_dataset
ds = load_dataset("c4", "en", streaming=True, split="train")
dataloader = DataLoader(ds, batch_size=32, num_workers=4)
```

Docs

Complete doc migration by @mishig25 in https://github.com/huggingface/datasets/pull/5248

General improvements and bug fixes

typo by @WrRan in https://github.com/huggingface/datasets/pull/5253
typo by @WrRan in https://github.com/huggingface/datasets/pull/5254
remove an unused statement by @WrRan in https://github.com/huggingface/datasets/pull/5257
fix wrong print by @WrRan in https://github.com/huggingface/datasets/pull/5256
Fix max_shard_size docs by @lhoestq in https://github.com/huggingface/datasets/pull/5267
Specify arguments as keywords in librosa.reshape to avoid future errors by @polinaeterna in https://github.com/huggingface/datasets/pull/5266
Change release procedure to use only pull requests by @albertvillanova in https://github.com/huggingface/datasets/pull/5250
Warn about checksums by @lhoestq in https://github.com/huggingface/datasets/pull/5279
Tweak readme by @lhoestq in https://github.com/huggingface/datasets/pull/5210
Save file name in embed_storage by @lhoestq in https://github.com/huggingface/datasets/pull/5285
Use correct dataset type in from_generator docs by @mariosasko in https://github.com/huggingface/datasets/pull/5307
Support streaming datasets with pathlib.Path.with_suffix by @albertvillanova in https://github.com/huggingface/datasets/pull/5294
Fix xjoin for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5297
Fix xopen for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5299
Ci py3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/5065
Update Overview.ipynb google colab by @lhoestq in https://github.com/huggingface/datasets/pull/5211
Support xPath for Windows pathnames by @albertvillanova in https://github.com/huggingface/datasets/pull/5310
Fix description of streaming in the docs by @polinaeterna in https://github.com/huggingface/datasets/pull/5313
Fix Text sample_by paragraph by @albertvillanova in https://github.com/huggingface/datasets/pull/5319
[Extract] Place the lock file next to the destination directory by @lhoestq in https://github.com/huggingface/datasets/pull/5320
Fix loading from HF GCP cache by @lhoestq in https://github.com/huggingface/datasets/pull/5321
- This was affecting datasets like wikipedia or natural_questions
Fix docs building for main by @albertvillanova in https://github.com/huggingface/datasets/pull/5328
Origin/fix missing features error by @eunseojo in https://github.com/huggingface/datasets/pull/5318
fix: 🐛 pass the token to get the list of config names by @severo in https://github.com/huggingface/datasets/pull/5333
Clarify imagefolder is for small datasets by @stevhliu in https://github.com/huggingface/datasets/pull/5329
Close stream in ArrowWriter.finalize before inference error by @mariosasko in https://github.com/huggingface/datasets/pull/5309
Use same num_proc for dataset download and generation by @mariosasko in https://github.com/huggingface/datasets/pull/5300
Set IterableDataset.map param batch_size typing as optional by @alvarobartt in https://github.com/huggingface/datasets/pull/5336
fix: dataset path should be absolute by @vigsterkr in https://github.com/huggingface/datasets/pull/5234
Clean up DatasetInfo and Dataset docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5340
Clean up docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5334
Remove tasks.json by @lhoestq in https://github.com/huggingface/datasets/pull/5341
Support topdown parameter in xwalk by @mariosasko in https://github.com/huggingface/datasets/pull/5308
Improve use_auth_token docstring and deprecate use_auth_token in download_and_prepare by @mariosasko in https://github.com/huggingface/datasets/pull/5302
Clean up Loading methods docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5350
Clean up remaining Main Classes docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5349
Clean up Dataset and DatasetDict by @stevhliu in https://github.com/huggingface/datasets/pull/5344
Clean up Table class docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5355
Raise error for .tar archives in the same way as for .tar.gz and .tgz in _get_extraction_protocol by @polinaeterna in https://github.com/huggingface/datasets/pull/5322
Clean filesystem and logging docstrings by @stevhliu in https://github.com/huggingface/datasets/pull/5356
ExamplesIterable fixes by @lhoestq in https://github.com/huggingface/datasets/pull/5366
Simplify skipping by @Muennighoff in https://github.com/huggingface/datasets/pull/5373
Release: 2.8.0 by @lhoestq in https://github.com/huggingface/datasets/pull/5375

New Contributors

@WrRan made their first contribution in https://github.com/huggingface/datasets/pull/5253
@eunseojo made their first contribution in https://github.com/huggingface/datasets/pull/5318
@vigsterkr made their first contribution in https://github.com/huggingface/datasets/pull/5234
@Muennighoff made their first contribution in https://github.com/huggingface/datasets/pull/5373

Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.8.0

Nov 22, 2022

Bug fixes

Remove YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277

Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.6.2

Bug fixes

Remove YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277

Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.7.1

Previous 1 2 3 4 5 Next

Similar releases

Other sources from this team

Similar sources

Latest

4.8.4

Source

@huggingface/datasets

Tracking Since

Apr 30, 2021

Last fetched Apr 19, 2026

.json·.md·.atom

Datasets

What's Changed

New Contributors

Bug Fixes

New Contributors

What's Changed

New Contributors

Bug fixes

Other improvements

New Contributors

Bug fixes

Bug fixes

Bug fixes

Bug fixes

Bug fixes

Other improvements

Important: caching

Dataset Configuration

Dataset Features

What's Changed

New Contributors

General improvements and bug fixes

Dataset Features

Experimental

General improvements and bug fixes

New Contributors

Datasets Features

General improvements and bug fixes

New Contributors

Important

Datasets Features

General imrovements and bug fixes

New Contributors

What's Changed

Important

Datasets features

Documentation

General improvements and bug fixes

New Contributors

Datasets Features

Documentation

General improvements and bug fixes

New Contributors

Important

Datasets Features

Docs

General improvements and bug fixes

New Contributors

Bug fixes

Bug fixes

More from this team

Similar releases

Other sources from this team

Similar sources

Other sources from this team

Similar sources

Similar releases

More from this team