Hugging Face/Datasets

Datasets

$npx -y @buildinternet/releases show datasets

Sun

Mon

Tue

Wed

Thu

Fri

Sat

AprMayJunJulAugSepOctNovDecJanFebMarApr

Less

Releases8Avg3/moVersionsv4.5.0 → v4.8.3

Mar 17, 2025

Bug Fixes

Fix data_files filtering by @lhoestq in https://github.com/huggingface/datasets/pull/7459

Full Changelog: https://github.com/huggingface/datasets/compare/3.4.0...3.4.1

Mar 14, 2025

Dataset Features

Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in https://github.com/huggingface/datasets/pull/7424
- /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version
```
from datasets import load_dataset, Video

dataset = load_dataset("path/to/video/folder", split="train")
dataset[0]["video"]  # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
```
- faster streaming for image/audio/video folder from Hugging Face
- support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
Add IterableDataset.decode with multithreading by @lhoestq in https://github.com/huggingface/datasets/pull/7450
- even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
```
dataset = dataset.decode(num_threads=num_threads)
```
Add with_split to DatasetDict.map by @jp1924 in https://github.com/huggingface/datasets/pull/7368

General improvements and bug fixes

fix: None default with bool type on load creates typing error by @stephantul in https://github.com/huggingface/datasets/pull/7426
Use pyupgrade --py39-plus by @cyyever in https://github.com/huggingface/datasets/pull/7428
Refactor string_to_dict to return None if there is no match instead of raising ValueError by @ringohoffman in https://github.com/huggingface/datasets/pull/7435
Fix small bugs with async map by @lhoestq in https://github.com/huggingface/datasets/pull/7445
Fix resuming after ds.set_epoch(new_epoch) by @lhoestq in https://github.com/huggingface/datasets/pull/7451
minor docs changes by @lhoestq in https://github.com/huggingface/datasets/pull/7452

New Contributors

@stephantul made their first contribution in https://github.com/huggingface/datasets/pull/7426
@cyyever made their first contribution in https://github.com/huggingface/datasets/pull/7428
@jp1924 made their first contribution in https://github.com/huggingface/datasets/pull/7368

Full Changelog: https://github.com/huggingface/datasets/compare/3.3.2...3.4.0

Feb 20, 2025

Bug fixes

Attempt to fix multiprocessing hang by closing and joining the pool before termination by @dakinggg in https://github.com/huggingface/datasets/pull/7411
Gracefully cancel async tasks by @lhoestq in https://github.com/huggingface/datasets/pull/7414

Other general improvements

Update use_with_pandas.mdx: to_pandas() correction in last section by @ibarrien in https://github.com/huggingface/datasets/pull/7407
Fix a typo in arrow_dataset.py by @jingedawang in https://github.com/huggingface/datasets/pull/7402

New Contributors

@dakinggg made their first contribution in https://github.com/huggingface/datasets/pull/7411
@ibarrien made their first contribution in https://github.com/huggingface/datasets/pull/7407
@jingedawang made their first contribution in https://github.com/huggingface/datasets/pull/7402

Full Changelog: https://github.com/huggingface/datasets/compare/3.3.1...3.3.2

Feb 17, 2025

Bug fixes

Fix filter speed regression by @lhoestq in https://github.com/huggingface/datasets/pull/7408

Full Changelog: https://github.com/huggingface/datasets/compare/3.3.0...3.3.1

Feb 14, 2025

Dataset Features

Support async functions in map() by @lhoestq in https://github.com/huggingface/datasets/pull/7384

Especially useful to download content like images or call inference APIs

prompt = "Answer the following question: {question}. You should think step by step."
async def ask_llm(example):
    return await query_model(prompt.format(question=example["question"]))
ds = ds.map(ask_llm)

Add repeat method to datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7198
```
ds = ds.repeat(10)
```
Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in https://github.com/huggingface/datasets/pull/7370
- Add support for "pandas" and "polars" formats in IterableDatasets
- This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
```
ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
ds = ds.with_format("polars")
expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
ds = ds.map(lambda df: df.with_columns(expr), batched=True)
```
Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7207
- IterableDatasets with "numpy" format are now much faster

What's Changed

don't import soundfile in tests by @lhoestq in https://github.com/huggingface/datasets/pull/7340
minor video docs on how to install by @lhoestq in https://github.com/huggingface/datasets/pull/7341
Fix typo in arrow_dataset by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7328
remove filecheck to enable symlinks by @fschlatt in https://github.com/huggingface/datasets/pull/7133
Webdataset special columns in last position by @lhoestq in https://github.com/huggingface/datasets/pull/7349
Bump hfh to 0.24 to fix ci by @lhoestq in https://github.com/huggingface/datasets/pull/7350
fsspec 2024.12.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7352
changes to MappedExamplesIterable to resolve #7345 by @vttrifonov in https://github.com/huggingface/datasets/pull/7353
Catch OSError for arrow by @lhoestq in https://github.com/huggingface/datasets/pull/7348
Remove .h5 from imagefolder extensions by @lhoestq in https://github.com/huggingface/datasets/pull/7374
Add Pandas, PyArrow and Polars docs by @lhoestq in https://github.com/huggingface/datasets/pull/7382
Optimized sequence encoding for scalars by @lukasgd in https://github.com/huggingface/datasets/pull/7393
Update docs by @lhoestq in https://github.com/huggingface/datasets/pull/7395
Update README.md by @lhoestq in https://github.com/huggingface/datasets/pull/7396
Release: 3.3.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7398

New Contributors

@AndreaFrancis made their first contribution in https://github.com/huggingface/datasets/pull/7328
@vttrifonov made their first contribution in https://github.com/huggingface/datasets/pull/7353
@lukasgd made their first contribution in https://github.com/huggingface/datasets/pull/7393

Full Changelog: https://github.com/huggingface/datasets/compare/3.2.0...3.3.0

Dec 10, 2024

Dataset Features

Faster parquet streaming + filters with predicate pushdown by @lhoestq in https://github.com/huggingface/datasets/pull/7309
- Up to +100% streaming speed
- Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
```
from datasets import load_dataset
filters = [('date', '>=', '2023')]
ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
```

Other improvements and bug fixes

fix conda release worlflow by @lhoestq in https://github.com/huggingface/datasets/pull/7272
Add link to video dataset by @NielsRogge in https://github.com/huggingface/datasets/pull/7277
Raise error for incorrect JSON serialization by @varadhbhatnagar in https://github.com/huggingface/datasets/pull/7273
support for custom feature encoding/decoding by @alex-hh in https://github.com/huggingface/datasets/pull/7284
update load_dataset doctring by @lhoestq in https://github.com/huggingface/datasets/pull/7301
Let server decide default repo visibility by @Wauplin in https://github.com/huggingface/datasets/pull/7302
fix: update elasticsearch version by @ruidazeng in https://github.com/huggingface/datasets/pull/7300
Fix typing in iterable_dataset.py by @lhoestq in https://github.com/huggingface/datasets/pull/7304
Updated inconsistent output in documentation examples for ClassLabel by @sergiopaniego in https://github.com/huggingface/datasets/pull/7293
More docs to from_dict to mention that the result lives in RAM by @lhoestq in https://github.com/huggingface/datasets/pull/7316
Release: 3.2.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7317

New Contributors

@ruidazeng made their first contribution in https://github.com/huggingface/datasets/pull/7300
@sergiopaniego made their first contribution in https://github.com/huggingface/datasets/pull/7293

Full Changelog: https://github.com/huggingface/datasets/compare/3.1.0...3.2.0

Oct 31, 2024

Dataset Features

Video support by @lhoestq in https://github.com/huggingface/datasets/pull/7230

>>> from datasets import Dataset, Video, load_dataset
>>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video())
>>> # or from the hub
>>> ds = load_dataset("username/dataset_name", split="train")
>>> ds[0]["video"]
<decord.video_reader.VideoReader at 0x105525c70>

Add IterableDataset.shard() by @lhoestq in https://github.com/huggingface/datasets/pull/7252

>>> from datasets import load_dataset
>>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True)
>>> full_ds.num_shards
2360
>>> ds = full_ds.shard(num_shards=ds.num_shards, index=0)
>>> ds.num_shards
1
>>> ds = full_ds.shard(num_shards=8, index=0)
>>> ds.num_shards
295

Basic XML support by @lhoestq in https://github.com/huggingface/datasets/pull/7250

What's Changed

(Super tiny doc update) Mention to_polars by @fzyzcjy in https://github.com/huggingface/datasets/pull/7232
[MINOR:TYPO] Update arrow_dataset.py by @cakiki in https://github.com/huggingface/datasets/pull/7236
Missing video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7251
fix decord import by @lhoestq in https://github.com/huggingface/datasets/pull/7255
fix ci for pyarrow 18 by @lhoestq in https://github.com/huggingface/datasets/pull/7257
Retry all requests timeouts by @lhoestq in https://github.com/huggingface/datasets/pull/7256
Always set non-null writer batch size by @lhoestq in https://github.com/huggingface/datasets/pull/7258
Don't embed videos by @lhoestq in https://github.com/huggingface/datasets/pull/7259
Allow video with disabeld decoding without decord by @lhoestq in https://github.com/huggingface/datasets/pull/7262
Small addition to video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7263
fix docs relative links by @lhoestq in https://github.com/huggingface/datasets/pull/7264
Disallow video push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7265

New Contributors

@fzyzcjy made their first contribution in https://github.com/huggingface/datasets/pull/7232

Full Changelog: https://github.com/huggingface/datasets/compare/3.0.2...3.1.0

Oct 22, 2024

Main bug fixes

fix unbatched arrow map for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7204
Support features in metadata configs by @albertvillanova in https://github.com/huggingface/datasets/pull/7182
Preserve features in iterable dataset.filter by @alex-hh in https://github.com/huggingface/datasets/pull/7209
Pin dill<0.3.9 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7184
- this should also fix cache issues

What's Changed

Fix release instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/7177
Pin multiprocess<0.70.1 to align with dill<0.3.9 by @albertvillanova in https://github.com/huggingface/datasets/pull/7188
with_format docstring by @lhoestq in https://github.com/huggingface/datasets/pull/7203
fix ci benchmark by @lhoestq in https://github.com/huggingface/datasets/pull/7205
Fix the environment variable for huggingface cache by @torotoki in https://github.com/huggingface/datasets/pull/7200
Support Python 3.11 by @albertvillanova in https://github.com/huggingface/datasets/pull/7179
bump fsspec by @lhoestq in https://github.com/huggingface/datasets/pull/7219
Fix typo in image dataset docs by @albertvillanova in https://github.com/huggingface/datasets/pull/7231
No need for dataset_info by @lhoestq in https://github.com/huggingface/datasets/pull/7234
use huggingface_hub offline mode by @lhoestq in https://github.com/huggingface/datasets/pull/7244

New Contributors

@alex-hh made their first contribution in https://github.com/huggingface/datasets/pull/7204
@torotoki made their first contribution in https://github.com/huggingface/datasets/pull/7200

Full Changelog: https://github.com/huggingface/datasets/compare/3.0.1...3.0.2

Sep 26, 2024

What's Changed

Modify add_column() to optionally accept a FeatureType as param by @varadhbhatnagar in https://github.com/huggingface/datasets/pull/7143
Align filename prefix splitting with WebDataset library by @albertvillanova in https://github.com/huggingface/datasets/pull/7151
Support ndjson data files by @albertvillanova in https://github.com/huggingface/datasets/pull/7154
Support JSON lines with missing struct fields by @albertvillanova in https://github.com/huggingface/datasets/pull/7160
Support JSON lines with empty struct by @albertvillanova in https://github.com/huggingface/datasets/pull/7162
fix increase_load_count by @lhoestq in https://github.com/huggingface/datasets/pull/7165
fix docstring code example for distributed shuffle by @lhoestq in https://github.com/huggingface/datasets/pull/7166
Support JSON lines with missing columns by @albertvillanova in https://github.com/huggingface/datasets/pull/7170
Add torchdata as a regular test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/7172

New Contributors

@varadhbhatnagar made their first contribution in https://github.com/huggingface/datasets/pull/7143

Full Changelog: https://github.com/huggingface/datasets/compare/3.0.0...3.0.1

Sep 11, 2024

Dataset Features

Use Polars functions in .map()

Allow Polars as valid output type by @psmyth94 in https://github.com/huggingface/datasets/pull/6762

Example:

>>> from datasets import load_dataset
>>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars")
>>> cols = [pl.col("content").str.len_bytes().alias("length")]
>>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True)
>>> ds_with_length[:5]
shape: (5, 5)
┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐
│ idx ┆ title                             ┆ content                           ┆ labels                ┆ length │
│ --- ┆ ---                               ┆ ---                               ┆ ---                   ┆ ---    │
│ i64 ┆ str                               ┆ str                               ┆ str                   ┆ u32    │
╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡
│ 0   ┆ The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure      ┆ 180    │
│ 1   ┆ Pikachu's Quest for Peace         ┆ Pikachu, with his cheeky persona… ┆ peaceful_narrative    ┆ 138    │
│ 2   ┆ The Tender Tale of Squirtle       ┆ Squirtle took everyone on a memo… ┆ gentle_adventure      ┆ 135    │
│ 3   ┆ Charizard's Heartwarming Tale     ┆ Charizard found joy in helping o… ┆ heartwarming_story    ┆ 112    │
│ 4   ┆ Jolteon's Sparkling Journey       ┆ Jolteon, with his zest for life,… ┆ celebratory_narrative ┆ 111    │
└─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘

Support NumPy 2
- Allow numpy-2.1 and test it without audio extra by @albertvillanova in https://github.com/huggingface/datasets/pull/7118

Cache Changes

Use huggingface_hub cache by @lhoestq in https://github.com/huggingface/datasets/pull/7105
- use the huggingface_hub cache for files downloaded from HF, by default at ~/.cache/huggingface/hub
- cached datasets (Arrow files) will still be reloaded from the datasets cache, by default at ~/.cache/huggingface/datasets

Breaking changes

Remove deprecated code by @albertvillanova in https://github.com/huggingface/datasets/pull/6996
- removed deprecated arguments like use_auth_token, fs or ignore_verifications
Remove beam by @albertvillanova in https://github.com/huggingface/datasets/pull/6987
- removed deprecated apache beam datasets support
Remove metrics by @albertvillanova in https://github.com/huggingface/datasets/pull/6983
- remove deprecated load_metric, please use the evaluate library instead
Remove tasks by @albertvillanova in https://github.com/huggingface/datasets/pull/6999
- remove deprecated task argument in load_dataset() .prepare_for_task() method, datasets.tasks module

General improvements and bug fixes

Improved the tutorial by adding a link for loading datasets by @AmboThom in https://github.com/huggingface/datasets/pull/7042
Automatically create cache_dir from cache_file_name by @ringohoffman in https://github.com/huggingface/datasets/pull/7096
remove more script docs by @lhoestq in https://github.com/huggingface/datasets/pull/7104
Fix args of feature docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/7103
Temporarily pin numpy<2.1 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7114
Fix ConnectionError for gated datasets and unauthenticated users by @albertvillanova in https://github.com/huggingface/datasets/pull/7110
Install transformers with numpy-2 CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7119
don't mention the script if trust_remote_code=False by @severo in https://github.com/huggingface/datasets/pull/7120
Fix typed examples iterable state dict by @lhoestq in https://github.com/huggingface/datasets/pull/7121
Rename LargeList.dtype to LargeList.feature by @albertvillanova in https://github.com/huggingface/datasets/pull/7106
Fix wrong SHA in CI tests of HubDatasetModuleFactoryWithParquetExport by @albertvillanova in https://github.com/huggingface/datasets/pull/7125
Disable implicit token in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7126
Test get_dataset_config_info with non-existing/gated/private dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/7124
fix streaming from arrow files by @fschlatt in https://github.com/huggingface/datasets/pull/7083

New Contributors

@AmboThom made their first contribution in https://github.com/huggingface/datasets/pull/7042
@fschlatt made their first contribution in https://github.com/huggingface/datasets/pull/7083

Full Changelog: https://github.com/huggingface/datasets/compare/2.21.0...3.0.0

Aug 14, 2024

Features

Support pyarrow large_list by @albertvillanova in https://github.com/huggingface/datasets/pull/7019

Support Polars round trip:

import polars as pl
from datasets import Dataset

df1 = pl.from_dict({"col_1": [[1, 2], [3, 4]]}
df2 = Dataset.from_polars(df).to_polars()
assert df1.equals(df2)

What's Changed

Use HF_HUB_OFFLINE instead of HF_DATASETS_OFFLINE by @Wauplin in https://github.com/huggingface/datasets/pull/6968
packaging: Remove useless dependencies by @daskol in https://github.com/huggingface/datasets/pull/6971
Fix resuming arrow format by @lhoestq in https://github.com/huggingface/datasets/pull/6964
Fix webdataset pickling by @lhoestq in https://github.com/huggingface/datasets/pull/6972
Set temporary numpy upper version < 2.0.0 to fix CI by @albertvillanova in https://github.com/huggingface/datasets/pull/6975
Fix regression for pandas < 2.0.0 in JSON loader by @albertvillanova in https://github.com/huggingface/datasets/pull/6978
Ensure compatibility with numpy 2.0.0 by @KennethEnevoldsen in https://github.com/huggingface/datasets/pull/6976
Remove underlines between badges by @novialriptide in https://github.com/huggingface/datasets/pull/6966
Update docs on trust_remote_code defaults to False by @albertvillanova in https://github.com/huggingface/datasets/pull/6981
Improve skip take shuffling and distributed by @lhoestq in https://github.com/huggingface/datasets/pull/6965
Fix tests using hf-internal-testing/librispeech_asr_dummy by @albertvillanova in https://github.com/huggingface/datasets/pull/6998
Fix dump of bfloat16 torch tensor by @lhoestq in https://github.com/huggingface/datasets/pull/7002
minor fix for bfloat16 by @lhoestq in https://github.com/huggingface/datasets/pull/7003
Fix incorrect rank value in data splitting by @yzhangcs in https://github.com/huggingface/datasets/pull/6994
less script docs by @lhoestq in https://github.com/huggingface/datasets/pull/6993
Fix CI by temporarily pinning ruff < 0.5.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/7007
Support ruff 0.5.0 in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7009
Fix WebDatasets KeyError for user-defined Features when a field is missing in an example by @ProGamerGov in https://github.com/huggingface/datasets/pull/7004
[Streaming] retry on requests errors by @lhoestq in https://github.com/huggingface/datasets/pull/6963
Re-enable raising error from huggingface-hub FutureWarning in CI by @albertvillanova in https://github.com/huggingface/datasets/pull/7011
Skip faiss tests on Windows to avoid running CI for 360 minutes by @albertvillanova in https://github.com/huggingface/datasets/pull/7014
Support fsspec 2024.6.1 by @albertvillanova in https://github.com/huggingface/datasets/pull/7017
Persist IterableDataset epoch in workers by @lhoestq in https://github.com/huggingface/datasets/pull/6710
Fix casting list array to fixed size list by @albertvillanova in https://github.com/huggingface/datasets/pull/7021
Remove dead code for pyarrow < 15.0.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/7023
Fix check_library_imports by @lhoestq in https://github.com/huggingface/datasets/pull/7026
Missing line from previous pr by @lhoestq in https://github.com/huggingface/datasets/pull/7027
Fix ci by @lhoestq in https://github.com/huggingface/datasets/pull/7028
Add decorator as explicit test dependency by @albertvillanova in https://github.com/huggingface/datasets/pull/7043
Mark tests that require librosa by @albertvillanova in https://github.com/huggingface/datasets/pull/7044
Unblock NumPy 2.0 by @NeilGirdhar in https://github.com/huggingface/datasets/pull/6991
Fix tensorflow min version depending on Python version by @albertvillanova in https://github.com/huggingface/datasets/pull/7045
Support librosa and numpy 2.0 for Python 3.10 by @albertvillanova in https://github.com/huggingface/datasets/pull/7046
add checkpoint and resume title in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7050
Update load_hub.mdx by @severo in https://github.com/huggingface/datasets/pull/7057
Add batching to IterableDataset by @lappemic in https://github.com/huggingface/datasets/pull/7054
Avoid calling http_head for non-HTTP URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/7062
Fix load_dataset for data_files with protocols other than HF by @matstrand in https://github.com/huggingface/datasets/pull/6862
Add batch method to Dataset class by @lappemic in https://github.com/huggingface/datasets/pull/7064
Fix doc generation when NamedSplit is used as parameter default value by @albertvillanova in https://github.com/huggingface/datasets/pull/7036
Fix CI by temporarily marking test_convert_to_parquet as expected to fail by @albertvillanova in https://github.com/huggingface/datasets/pull/7074
add split argument to Generator by @piercus in https://github.com/huggingface/datasets/pull/7015
Update required soxr version from pre-release to release by @albertvillanova in https://github.com/huggingface/datasets/pull/7075
Fix CI test_convert_to_parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/7078
Fix prepare_single_hop_path_and_storage_options by @albertvillanova in https://github.com/huggingface/datasets/pull/7068
Set load_from_disk path type as PathLike by @albertvillanova in https://github.com/huggingface/datasets/pull/7081
Fix push_to_hub by not calling create_branch if branch exists by @albertvillanova in https://github.com/huggingface/datasets/pull/7069
feat: support non streamable arrow file binary format by @kmehant in https://github.com/huggingface/datasets/pull/7025
Support HTTP authentication in non-streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/7082
chore: fix typos in docs by @hattizai in https://github.com/huggingface/datasets/pull/7034
Fix CI for metrics by @albertvillanova in https://github.com/huggingface/datasets/commit/83e5c05fd38a4a37b5e6d5d7c0cfa73d76f1b220

New Contributors

@novialriptide made their first contribution in https://github.com/huggingface/datasets/pull/6966
@yzhangcs made their first contribution in https://github.com/huggingface/datasets/pull/6994
@ProGamerGov made their first contribution in https://github.com/huggingface/datasets/pull/7004
@NeilGirdhar made their first contribution in https://github.com/huggingface/datasets/pull/6991
@matstrand made their first contribution in https://github.com/huggingface/datasets/pull/6862
@lappemic made their first contribution in https://github.com/huggingface/datasets/pull/7054
@piercus made their first contribution in https://github.com/huggingface/datasets/pull/7015
@kmehant made their first contribution in https://github.com/huggingface/datasets/pull/7025
@hattizai made their first contribution in https://github.com/huggingface/datasets/pull/7034

Full Changelog: https://github.com/huggingface/datasets/compare/2.20.0...2.21.0

Jun 13, 2024

Important

Remove default trust_remote_code=True by @lhoestq in https://github.com/huggingface/datasets/pull/6954
- datasets with a python loading script now require passing trust_remote_code=True to be used

Datasets features

[Resumable IterableDataset] Add IterableDataset state_dict by @lhoestq in https://github.com/huggingface/datasets/pull/6658

checkpoint and resume an iterable dataset (e.g. when streaming):

>>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
>>> for idx, example in enumerate(iterable_dataset):
...     print(example)
...     if idx == 2:
...         state_dict = iterable_dataset.state_dict()
...         print("checkpoint")
...         break
>>> iterable_dataset.load_state_dict(state_dict)
>>> print(f"restart from checkpoint")
>>> for example in iterable_dataset:
...     print(example)

Returns:

{'a': 0}
{'a': 1}
{'a': 2}
checkpoint
restart from checkpoint
{'a': 3}
{'a': 4}
{'a': 5}

General improvements and bug fixes

Add docs about the CLI by @albertvillanova in https://github.com/huggingface/datasets/pull/6831
Remove token arg from CLI examples by @albertvillanova in https://github.com/huggingface/datasets/pull/6839
Allow deleting a subset/config from a no-script dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6820
Fix line-endings in tests on Windows by @albertvillanova in https://github.com/huggingface/datasets/pull/6857
Fix CI by temporarily pinning huggingface-hub < 0.23.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6861
Fix dataset name for community Hub script-datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6855
Update tqdm >= 4.66.3 to fix vulnerability by @albertvillanova in https://github.com/huggingface/datasets/pull/6870
Fix download for dict of dicts of URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/6871
Set dev version by @albertvillanova in https://github.com/huggingface/datasets/pull/6873
Shorten long logs by @lhoestq in https://github.com/huggingface/datasets/pull/6875
Support jax 0.4.27 in CI tests by @albertvillanova in https://github.com/huggingface/datasets/pull/6885
Close gzipped files properly by @lhoestq in https://github.com/huggingface/datasets/pull/6893
Make CLI convert_to_parquet not raise error if no rights to create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6902
Fix YAML error in README files appearing on GitHub by @albertvillanova in https://github.com/huggingface/datasets/pull/6898
Document that to_json defaults to JSON Lines by @albertvillanova in https://github.com/huggingface/datasets/pull/6895
Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6883
Create function to convert to parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/6878
Update features.py to avoid bfloat16 unsupported error by @skaulintel in https://github.com/huggingface/datasets/pull/6607
Fix decoding multi part extension by @lhoestq in https://github.com/huggingface/datasets/pull/6904
Use pandas ujson in JSON loader to improve performance by @albertvillanova in https://github.com/huggingface/datasets/pull/6874
Update requests >=2.32.1 to fix vulnerability by @albertvillanova in https://github.com/huggingface/datasets/pull/6909
Fix wrong type hints in data_files by @albertvillanova in https://github.com/huggingface/datasets/pull/6910
Remove dead code for non-dict data_files from packaged modules by @albertvillanova in https://github.com/huggingface/datasets/pull/6911
Support fsspec 2024.5.0 by @albertvillanova in https://github.com/huggingface/datasets/pull/6921
Remove torchaudio remnants from code by @albertvillanova in https://github.com/huggingface/datasets/pull/6922
[WebDataset] Add .pth support for torch tensors by @lhoestq in https://github.com/huggingface/datasets/pull/6920
Unpin hfh by @lhoestq in https://github.com/huggingface/datasets/pull/6876
Preserve JSON column order and support list of strings field by @albertvillanova in https://github.com/huggingface/datasets/pull/6914
[WebDataset] Support compressed files by @lhoestq in https://github.com/huggingface/datasets/pull/6931
update ci user by @lhoestq in https://github.com/huggingface/datasets/pull/6933
Revert ci user by @lhoestq in https://github.com/huggingface/datasets/pull/6934
Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing data_dir/data_files in no-code Hub datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6925
Set dev version by @albertvillanova in https://github.com/huggingface/datasets/pull/6944
Update yanked version of minimum requests requirement by @albertvillanova in https://github.com/huggingface/datasets/pull/6945
Re-enable import sorting disabled by flake8:noqa directive when using ruff linter by @albertvillanova in https://github.com/huggingface/datasets/pull/6946
Update dataset_dict.py by @Arunprakash-A in https://github.com/huggingface/datasets/pull/6932
Update process.mdx: Code Listings Fixes by @FadyMorris in https://github.com/huggingface/datasets/pull/6928
Fix small typo by @marcenacp in https://github.com/huggingface/datasets/pull/6955
update docs on N-dim arrays by @lhoestq in https://github.com/huggingface/datasets/pull/6956
Fix typos in docs by @albertvillanova in https://github.com/huggingface/datasets/pull/6957
Validate config name and data_files in packaged modules by @albertvillanova in https://github.com/huggingface/datasets/pull/6915
Add support for categorical/dictionary types by @EthanSteinberg in https://github.com/huggingface/datasets/pull/6892
feat(ci): add trufflehog secrets detection by @McPatate in https://github.com/huggingface/datasets/pull/6960
Better error handling in dataset_module_factory by @Wauplin in https://github.com/huggingface/datasets/pull/6959
Move info_utils errors to exceptions module by @albertvillanova in https://github.com/huggingface/datasets/pull/6952
fix(ci): remove unnecessary permissions by @McPatate in https://github.com/huggingface/datasets/pull/6962

New Contributors

@skaulintel made their first contribution in https://github.com/huggingface/datasets/pull/6607
@Arunprakash-A made their first contribution in https://github.com/huggingface/datasets/pull/6932
@FadyMorris made their first contribution in https://github.com/huggingface/datasets/pull/6928
@marcenacp made their first contribution in https://github.com/huggingface/datasets/pull/6955
@EthanSteinberg made their first contribution in https://github.com/huggingface/datasets/pull/6892
@McPatate made their first contribution in https://github.com/huggingface/datasets/pull/6960

Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.20.0

Jun 3, 2024

Bug fixes

Make CLI convert_to_parquet not raise error if no rights to create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6902
Require Pillow >= 9.4.0 to avoid AttributeError when loading image dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6883
Update requests >=2.32.1 to fix vulnerability by @albertvillanova in https://github.com/huggingface/datasets/pull/6909
Fix NonMatchingSplitsSizesError/ExpectedMoreSplits when passing data_dir/data_files in no-code Hub datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6925

Full Changelog: https://github.com/huggingface/datasets/compare/2.19.1...2.19.2

May 6, 2024

Bug fixes

Fix download for dict of dicts of URLs by @albertvillanova in https://github.com/huggingface/datasets/pull/6871

Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.19.1

Apr 19, 2024

Dataset Features

Add Polars compatibility by @psmyth94 in https://github.com/huggingface/datasets/pull/6531

convert to a Polars dataframe using .to_polars();

import polars as pl
from datasets import load_dataset
ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
ds.to_polars() \
    .groupby("topic") \
    .agg(pl.len(), pl.first()) \
    .sort("len", descending=True)

Use Polars formatting to return Polars objects when accessing a dataset:
```
ds = ds.with_format("polars")
ds[:10].group_by("kind").len()
```

Add fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in https://github.com/huggingface/datasets/pull/6096

Save on HF in any file format:

ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")

Add mode parameter to Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/6735
- Set images to be read in a certain mode like "RGB"
```
dataset = dataset.cast_column("image", Image(mode="RGB"))
```
Add CLI function to convert script-dataset to Parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/6795
- run command to open a PR in script-based dataset to convert it to Parquet:
```
datasets-cli convert_to_parquet <dataset_id>
```
Add Dataset.take and Dataset.skip by @lhoestq in https://github.com/huggingface/datasets/pull/6813
- same as IterableDataset.take and IterableDataset.skip
```
ds = ds.take(10)  # take only the first 10 examples
```

General improvements and bug fixes

Bump huggingface-hub lower version to 0.21.2 by @albertvillanova in https://github.com/huggingface/datasets/pull/6713
fix CastError pickling by @lhoestq in https://github.com/huggingface/datasets/pull/6712
Expand no-code dataset info with datasets-server info by @mariosasko in https://github.com/huggingface/datasets/pull/6714
Fix sliced ConcatenationTable pickling with mixed schemas vertically by @lhoestq in https://github.com/huggingface/datasets/pull/6715
Fix concurrent script loading with force_redownload by @lhoestq in https://github.com/huggingface/datasets/pull/6718
get_dataset_default_config_name docstring by @lhoestq in https://github.com/huggingface/datasets/pull/6723
Deprecate Beam API and download from HF GCS bucket by @mariosasko in https://github.com/huggingface/datasets/pull/6474
Deprecate Pandas builder by @mariosasko in https://github.com/huggingface/datasets/pull/6730
Using a registry instead of calling globals for fetching feature types by @psmyth94 in https://github.com/huggingface/datasets/pull/6727
Update torch_formatter.py by @VarunNSrivastava in https://github.com/huggingface/datasets/pull/6402
Improve default patterns resolution by @mariosasko in https://github.com/huggingface/datasets/pull/6704
Transpose images with EXIF Orientation tag by @mariosasko in https://github.com/huggingface/datasets/pull/6739
Fix missing download_config in get_data_patterns by @lhoestq in https://github.com/huggingface/datasets/pull/6742
Allow null values in dict columns by @mariosasko in https://github.com/huggingface/datasets/pull/6743
Fix fsspec tqdm callback by @lhoestq in https://github.com/huggingface/datasets/pull/6749
chore(deps): bump fsspec by @shcheklein in https://github.com/huggingface/datasets/pull/6747
Fix offline mode with single config by @lhoestq in https://github.com/huggingface/datasets/pull/6741
Remove deprecated code by @Wauplin in https://github.com/huggingface/datasets/pull/6761
fixing the issue 6755(small typo) by @JINO-ROHIT in https://github.com/huggingface/datasets/pull/6767
remove_columns/rename_columns doc fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6772
Fix CI by @mariosasko in https://github.com/huggingface/datasets/pull/6780
rename datasets-server to dataset-viewer by @severo in https://github.com/huggingface/datasets/pull/6785
Install dependencies with uv in CI by @mariosasko in https://github.com/huggingface/datasets/pull/6779
Fix cache conflict in _check_legacy_cache2 by @lhoestq in https://github.com/huggingface/datasets/pull/6792
Fix typo in docs (upload CLI) by @Wauplin in https://github.com/huggingface/datasets/pull/6802
fix DatasetBuilder._split_generators incomplete type annotation by @JonasLoos in https://github.com/huggingface/datasets/pull/6799
#6791 Improve type checking around FAISS by @Dref360 in https://github.com/huggingface/datasets/pull/6803
Fix --repo-type order in cli upload docs by @lhoestq in https://github.com/huggingface/datasets/pull/6804
Fix hf-internal-testing/dataset_with_script commit SHA in CI test by @albertvillanova in https://github.com/huggingface/datasets/pull/6806
Fix cache path to snakecase for CachedDatasetModuleFactory and Cache by @izhx in https://github.com/huggingface/datasets/pull/6754
Multithreaded downloads by @lhoestq in https://github.com/huggingface/datasets/pull/6794
Remove os.path.relpath in resolve_patterns by @mariosasko in https://github.com/huggingface/datasets/pull/6815
Extract data on the fly in packaged builders by @mariosasko in https://github.com/huggingface/datasets/pull/6784
add allow_primitive_to_str and allow_decimal_to_str instead of allow_number_to_str by @Modexus in https://github.com/huggingface/datasets/pull/6811
Support indexable objects in Dataset.__getitem__ by @mariosasko in https://github.com/huggingface/datasets/pull/6817
Make convert_to_parquet CLI command create script branch by @albertvillanova in https://github.com/huggingface/datasets/pull/6809
Fix parquet export infos by @lhoestq in https://github.com/huggingface/datasets/pull/6822

New Contributors

@VarunNSrivastava made their first contribution in https://github.com/huggingface/datasets/pull/6402
@shcheklein made their first contribution in https://github.com/huggingface/datasets/pull/6747
@JINO-ROHIT made their first contribution in https://github.com/huggingface/datasets/pull/6767
@JonasLoos made their first contribution in https://github.com/huggingface/datasets/pull/6799
@izhx made their first contribution in https://github.com/huggingface/datasets/pull/6754
@Modexus made their first contribution in https://github.com/huggingface/datasets/pull/6811

Full Changelog: https://github.com/huggingface/datasets/compare/2.18.0...2.19.0

Mar 1, 2024

Dataset features

Make JSON builder support an array of strings by @albertvillanova in https://github.com/huggingface/datasets/pull/6696
Base parquet batch_size on parquet row group size by @lhoestq in https://github.com/huggingface/datasets/pull/6701
- Faster cold start for streaming
Change default compression argument for JsonDatasetWriter by @Rexhaif in https://github.com/huggingface/datasets/pull/6659
Automatic Conversion for uint16/uint32 to Compatible PyTorch Dtypes by @mohalisad in https://github.com/huggingface/datasets/pull/6660
fsspec: support fsspec>=2023.12.0 glob changes by @pmrowla in https://github.com/huggingface/datasets/pull/6687
- Support latest fsspec up to 2024.2.0

General improvements and bug fixes

Fix for Incorrect ex_iterable used with multi num_worker by @kq-chen in https://github.com/huggingface/datasets/pull/6582
- Previously using PyTorch DDP and num_workers could lead to incorrect shards assignments to workers and cause errors
Fix imagefolder dataset url by @mariosasko in https://github.com/huggingface/datasets/pull/6683
Improve error message for gated datasets on load by @lewtun in https://github.com/huggingface/datasets/pull/6684
Updated Quickstart Notebook link by @Codeblockz in https://github.com/huggingface/datasets/pull/6685
Update the print message for chunked_dataset in process.mdx by @gzbfgjf2 in https://github.com/huggingface/datasets/pull/6693
Faster xlistdir by @mariosasko in https://github.com/huggingface/datasets/pull/6698
Update GitHub Actions to Node 20 by @albertvillanova in https://github.com/huggingface/datasets/pull/6682
Update release instructions by @albertvillanova in https://github.com/huggingface/datasets/pull/6681
Pass through information about location of cache directory. by @stridge-cruxml in https://github.com/huggingface/datasets/pull/6677
Allow SplitDict setitem to replace existing SplitInfo by @lhoestq in https://github.com/huggingface/datasets/pull/6665
Update ruff by @lhoestq in https://github.com/huggingface/datasets/pull/6706
Silence ruff deprecation messages by @mariosasko in https://github.com/huggingface/datasets/pull/6707
fix: show correct package name to install biopython by @BioGeek in https://github.com/huggingface/datasets/pull/6662
Fix data_files when passing data_dir by @lhoestq in https://github.com/huggingface/datasets/pull/6705
Release: 2.18.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6708

New Contributors

@Codeblockz made their first contribution in https://github.com/huggingface/datasets/pull/6685
@gzbfgjf2 made their first contribution in https://github.com/huggingface/datasets/pull/6693
@stridge-cruxml made their first contribution in https://github.com/huggingface/datasets/pull/6677
@pmrowla made their first contribution in https://github.com/huggingface/datasets/pull/6687
@BioGeek made their first contribution in https://github.com/huggingface/datasets/pull/6662
@Rexhaif made their first contribution in https://github.com/huggingface/datasets/pull/6659
@mohalisad made their first contribution in https://github.com/huggingface/datasets/pull/6660
@kq-chen made their first contribution in https://github.com/huggingface/datasets/pull/6582

Full Changelog: https://github.com/huggingface/datasets/compare/2.17.1...2.18.0

Feb 19, 2024

Bug Fixes

Revert the changes in arrow_writer.py from #6636 by @bryant1410 in https://github.com/huggingface/datasets/pull/6664
Remove deprecated verbose parameter from CSV builder by @albertvillanova in https://github.com/huggingface/datasets/pull/6672

Full Changelog: https://github.com/huggingface/datasets/compare/2.17.0...2.17.1

Feb 9, 2024

Dataset Features

[WebDataset] Audio support and bug fixes by @lhoestq in https://github.com/huggingface/datasets/pull/6573
Add concurrent loading of shards to datasets.load_from_disk by @kkoutini in https://github.com/huggingface/datasets/pull/6464
Support data_dir parameter in push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6634
Support push_to_hub without org/user to default to logged-in user by @albertvillanova in https://github.com/huggingface/datasets/pull/6629
Allow concatenation of datasets with mixed structs by @Dref360 in https://github.com/huggingface/datasets/pull/6587

General improvements and bug fixes

Fix parallel downloads for datasets without scripts by @lhoestq in https://github.com/huggingface/datasets/pull/6551
Fix imagefolder with one image by @lhoestq in https://github.com/huggingface/datasets/pull/6556
Fix tests based on datasets that used to have scripts by @lhoestq in https://github.com/huggingface/datasets/pull/6574
remove eli5 test by @lhoestq in https://github.com/huggingface/datasets/pull/6583
[IterableDataset] Fix drop_last_batchin map after shuffling or sharding by @lhoestq in https://github.com/huggingface/datasets/pull/6575
Support standalone yaml by @lhoestq in https://github.com/huggingface/datasets/pull/6557
Drop redundant None guard. by @xkszltl in https://github.com/huggingface/datasets/pull/6596
fix os.listdir return name is empty string by @d710055071 in https://github.com/huggingface/datasets/pull/6581
Fix CI: pyarrow 15, pandas 2.2 and sqlachemy by @lhoestq in https://github.com/huggingface/datasets/pull/6617
Dedicated RNG object for fingerprinting by @mariosasko in https://github.com/huggingface/datasets/pull/6606
Migrate from setup.cfg to pyproject.toml by @mariosasko in https://github.com/huggingface/datasets/pull/6619
keep more info in DatasetInfo.from_merge #6585 by @JochenSiegWork in https://github.com/huggingface/datasets/pull/6586
Read GeoParquet files using parquet reader by @weiji14 in https://github.com/huggingface/datasets/pull/6508
Use schema metadata only if it matches features by @lhoestq in https://github.com/huggingface/datasets/pull/6616
Raise error on bad split name by @lhoestq in https://github.com/huggingface/datasets/pull/6626
Disable tqdm bars in non-interactive environments by @mariosasko in https://github.com/huggingface/datasets/pull/6627
Add with_rank param to Dataset.filter by @mariosasko in https://github.com/huggingface/datasets/pull/6608
Bump max range of dill to 0.3.8 by @ringohoffman in https://github.com/huggingface/datasets/pull/6630
Fix filelock: use current umask for filelock >= 3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/6631
Faster webdataset streaming by @lhoestq in https://github.com/huggingface/datasets/pull/6578
Multi gpu docs by @lhoestq in https://github.com/huggingface/datasets/pull/6550
dataset viewer requires no-script by @severo in https://github.com/huggingface/datasets/pull/6633
Make split slicing consistent with list slicing by @mariosasko in https://github.com/huggingface/datasets/pull/5891
Do not use Parquet exports if revision is passed by @albertvillanova in https://github.com/huggingface/datasets/pull/6555
Make CLI test support multi-processing by @albertvillanova in https://github.com/huggingface/datasets/pull/6628
Fix reload cache with data dir by @lhoestq in https://github.com/huggingface/datasets/pull/6632
Fix array cast/embed with null values by @mariosasko in https://github.com/huggingface/datasets/pull/6283
Faster column validation and reordering by @psmyth94 in https://github.com/huggingface/datasets/pull/6636
Better multi-gpu example by @lhoestq in https://github.com/huggingface/datasets/pull/6646
Fix missing info when loading some datasets from Parquet export by @lhoestq in https://github.com/huggingface/datasets/pull/6635
Minor multi gpu doc improvement by @lhoestq in https://github.com/huggingface/datasets/pull/6649
Document usage of hfh cli instead of git by @lhoestq in https://github.com/huggingface/datasets/pull/6648

New Contributors

@xkszltl made their first contribution in https://github.com/huggingface/datasets/pull/6596
@kkoutini made their first contribution in https://github.com/huggingface/datasets/pull/6464
@JochenSiegWork made their first contribution in https://github.com/huggingface/datasets/pull/6586
@weiji14 made their first contribution in https://github.com/huggingface/datasets/pull/6508
@ringohoffman made their first contribution in https://github.com/huggingface/datasets/pull/6630
@psmyth94 made their first contribution in https://github.com/huggingface/datasets/pull/6636

Full Changelog: https://github.com/huggingface/datasets/compare/2.16.1...2.17.0

Dec 30, 2023

Bug fixes

Fix dl_manager.extract returning FileNotFoundError by @lhoestq in https://github.com/huggingface/datasets/pull/6543
- Fix bug causing FileNotFoundError when passing a relative directory as cache_dir to load_dataset
Fix custom configs from script by @lhoestq in https://github.com/huggingface/datasets/pull/6544
- Fix bug when loading a dataset with a loading script using custom arguments would fail
- e.g. load_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")

Full Changelog: https://github.com/huggingface/datasets/compare/2.16.0...2.16.1

Dec 22, 2023

Security features

Add trust_remote_code argument by @lhoestq in https://github.com/huggingface/datasets/pull/6429
- Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.
- Passing trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.
- Using the environment variable HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasets
Use parquet export if possible by @lhoestq in https://github.com/huggingface/datasets/pull/6448
- This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
- You can see a dataset's Parquet export at https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet

Features

Webdataset dataset builder by @lhoestq in https://github.com/huggingface/datasets/pull/6391
Implement get dataset default config name by @albertvillanova in https://github.com/huggingface/datasets/pull/6511
Lazy data files resolution and offline cache reload by @lhoestq in https://github.com/huggingface/datasets/pull/6493
- This speeds up the load_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newer
- Fix load_dataset that used to reload data from cache even if the dataset was updated on Hugging Face
- Reload a dataset from your cache even if you don't have internet connection
- New cache directory scheme for no-script datasets: ~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
- Backward comaptibility: cached datasets from datasets 2.15 (using the old scheme) are still reloaded from cache

General improvements and bug fixes

Remove unused argument in _get_data_files_patterns by @lhoestq in https://github.com/huggingface/datasets/pull/6343
Set usedforsecurity=False in hashlib methods (FIPS compliance) by @Wauplin in https://github.com/huggingface/datasets/pull/6414
Use ruff for formatting by @mariosasko in https://github.com/huggingface/datasets/pull/6434
Create DatasetNotFoundError and DataFilesNotFoundError by @albertvillanova in https://github.com/huggingface/datasets/pull/6431
Fix multi gpu map example by @lhoestq in https://github.com/huggingface/datasets/pull/6415
Better tqdm wrapper by @mariosasko in https://github.com/huggingface/datasets/pull/6433
Remove Table.__getstate__ and Table.__setstate__ by @LZHgrla in https://github.com/huggingface/datasets/pull/6444
Use filelock package for file locking by @mariosasko in https://github.com/huggingface/datasets/pull/6445
Fix metadata file resolution when inferred pattern is ** by @mariosasko in https://github.com/huggingface/datasets/pull/6449
Update hub-docs reference by @mishig25 in https://github.com/huggingface/datasets/pull/6453
Refactor dill logic by @mariosasko in https://github.com/huggingface/datasets/pull/6454
Don't require trust_remote_code in inspect_dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6456
[docs] troubleshooting guide by @MKhalusova in https://github.com/huggingface/datasets/pull/6424
Missing DatasetNotFoundError by @lhoestq in https://github.com/huggingface/datasets/pull/6462
Disable benchmarks in PRs by @lhoestq in https://github.com/huggingface/datasets/pull/6463
More robust temporary directory deletion by @mariosasko in https://github.com/huggingface/datasets/pull/6426
Fix shard retry mechanism in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6461
Use auth to get parquet export by @lhoestq in https://github.com/huggingface/datasets/pull/6468
Remove delete doc CI by @lhoestq in https://github.com/huggingface/datasets/pull/6471
Fix CI quality by @albertvillanova in https://github.com/huggingface/datasets/pull/6473
Fix PermissionError on Windows CI by @albertvillanova in https://github.com/huggingface/datasets/pull/6477
More robust preupload retry mechanism by @mariosasko in https://github.com/huggingface/datasets/pull/6479
Add IterableDataset __repr__ by @lhoestq in https://github.com/huggingface/datasets/pull/6480
Fix max lock length on unix by @lhoestq in https://github.com/huggingface/datasets/pull/6482
Fix ArrayXD YAML conversion by @mariosasko in https://github.com/huggingface/datasets/pull/6168
Fix docs phrasing about supported formats when sharing a dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/6486
Fix deprecation warning when building conda package by @albertvillanova in https://github.com/huggingface/datasets/pull/6425
Make push_to_hub return CommitInfo by @albertvillanova in https://github.com/huggingface/datasets/pull/6492
docs: add reference Git over SSH by @severo in https://github.com/huggingface/datasets/pull/6499
Fallback on dataset script if user wants to load default config by @lhoestq in https://github.com/huggingface/datasets/pull/6498
Don't expand_info in HF glob by @lhoestq in https://github.com/huggingface/datasets/pull/6469
Fix streaming xnli by @lhoestq in https://github.com/huggingface/datasets/pull/6503
Pickle support for torch.Generator objects by @mariosasko in https://github.com/huggingface/datasets/pull/6502
Enable setting config as default when push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6500
Better cast error when generating dataset by @lhoestq in https://github.com/huggingface/datasets/pull/6509
Replace list_files_info with list_repo_tree in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6510
Remove deprecated HfFolder by @lhoestq in https://github.com/huggingface/datasets/pull/6512
Support huggingface-hub pre-releases by @albertvillanova in https://github.com/huggingface/datasets/pull/6516
Support push_to_hub canonical datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/6519
Support commit_description parameter in push_to_hub by @albertvillanova in https://github.com/huggingface/datasets/pull/6520
fix get_metadata_patterns function args error by @d710055071 in https://github.com/huggingface/datasets/pull/6518
Fix metrics dead link by @qgallouedec in https://github.com/huggingface/datasets/pull/6491
fix tests by @lhoestq in https://github.com/huggingface/datasets/pull/6523
Cache backward compatibility with 2.15.0 by @lhoestq in https://github.com/huggingface/datasets/pull/6514
Preserve order of configs and splits when using Parquet exports by @albertvillanova in https://github.com/huggingface/datasets/pull/6526

New Contributors

@LZHgrla made their first contribution in https://github.com/huggingface/datasets/pull/6444
@d710055071 made their first contribution in https://github.com/huggingface/datasets/pull/6518

Full Changelog: https://github.com/huggingface/datasets/compare/2.15.0...2.16.0

Previous 1 2 3 4 Next

Similar releases

Other sources from this team

Similar sources

Latest

4.8.4

Source

@huggingface/datasets

Tracking Since

Apr 30, 2021

Last fetched Apr 19, 2026

.json·.md·.atom

Datasets

Bug Fixes

Dataset Features

General improvements and bug fixes

New Contributors

Bug fixes

Other general improvements

New Contributors

Bug fixes

Dataset Features

What's Changed

New Contributors

Dataset Features

Other improvements and bug fixes

New Contributors

Dataset Features

What's Changed

New Contributors

Main bug fixes

What's Changed

New Contributors

What's Changed

New Contributors

Dataset Features

Cache Changes

Breaking changes

General improvements and bug fixes

New Contributors

Features

What's Changed

New Contributors

Important

Datasets features

General improvements and bug fixes

New Contributors

Bug fixes

Bug fixes

Dataset Features

General improvements and bug fixes

New Contributors

Dataset features

General improvements and bug fixes

New Contributors

Bug Fixes

Dataset Features

General improvements and bug fixes

New Contributors

Bug fixes

Security features

Features

General improvements and bug fixes

New Contributors

More from this team

Similar releases

Other sources from this team

Similar sources

Similar sources

Other sources from this team

Similar releases

More from this team