Hugging Face/Datasets

Datasets

$npx -y @buildinternet/releases show datasets

Sun

Mon

Tue

Wed

Thu

Fri

Sat

AprMayJunJulAugSepOctNovDecJanFebMarApr

Less

Releases8Avg3/moVersionsv4.5.0 → v4.8.3

Dec 21, 2021

Dataset Changes

New: The Pile
- Add The Pile dataset and PubMed Central subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3287
- Add The Pile Free Law subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3359
- Add The Pile USPTO subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3360
- Add The Pile subsets by @albertvillanova in https://github.com/huggingface/datasets/pull/3378
- Add The Pile Enron Emails subset by @albertvillanova in https://github.com/huggingface/datasets/pull/3427
New: British Library Books Genre by @davanstrien in https://github.com/huggingface/datasets/pull/3312
New: Americas NLI by @fdschmidt93 in https://github.com/huggingface/datasets/pull/3371
New: Speech commands by @polinaeterna in https://github.com/huggingface/datasets/pull/3335
New: eli5_category by @jingshenSN2 in https://github.com/huggingface/datasets/pull/3420
New: OneStopQa by @scaperex in https://github.com/huggingface/datasets/pull/3436
Update: LABR - make the dataset streamable by @albertvillanova in https://github.com/huggingface/datasets/pull/3352
Update: CLUE benchmark - update cluewsc2020, chid, c3 and tnews by @mariosasko in https://github.com/huggingface/datasets/pull/3376
Update: beans, cast_vs_dogs, cifar10, cifar100, fashion_mnist, mnist, head_qa: use the new Image feature type + streaming support by @mariosasko in https://github.com/huggingface/datasets/pull/3362
Update: CC100- add Georgian data by @AnzorGozalishvili in https://github.com/huggingface/datasets/pull/3383
Update: disaster_response_messages - update download urls (+ add validation split) by @mariosasko in https://github.com/huggingface/datasets/pull/3426
Update: swahili_news - update to new version by @albertvillanova in https://github.com/huggingface/datasets/pull/3463
Fix: WikiAuto, Jeopardy, definite_pronoun_resolution - fix URLs by @LashaO in https://github.com/huggingface/datasets/pull/3266
Fix: QED - fix type of bridge field by @mariosasko in https://github.com/huggingface/datasets/pull/3417
Fix: ASSET - fix dataset data URLs by @tianjianjiang in https://github.com/huggingface/datasets/pull/3342

Dataset Features

Add Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/3163
to_tf_dataset() refactor by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3356
More robust None handling by @mariosasko in https://github.com/huggingface/datasets/pull/3195
Add cast_column to IterableDataset by @mariosasko in https://github.com/huggingface/datasets/pull/3439
Support streaming zipped dataset repo by passing only repo name by @albertvillanova in https://github.com/huggingface/datasets/pull/3375
Extend support for streaming datasets that use pd.read_excel by @albertvillanova in https://github.com/huggingface/datasets/pull/3355
Extend iter_archive to support file object input by @albertvillanova in https://github.com/huggingface/datasets/pull/3443
Extend text to support yielding lines, paragraphs or documents by @albertvillanova in https://github.com/huggingface/datasets/pull/3442
Push dataset_infos.json to Hub to preserve feature types by @lhoestq in https://github.com/huggingface/datasets/pull/3467

Dataset cards

Change TriviaQA license (#3313) by @avinashsai in https://github.com/huggingface/datasets/pull/3330
Add missing tags to XTREME by @mariosasko in https://github.com/huggingface/datasets/pull/3322
Remove duplicate name from dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3354
Fix typos in dataset cards by @albertvillanova in https://github.com/huggingface/datasets/pull/3386
Fix duplicated tag in wikicorpus dataset card by @lhoestq in https://github.com/huggingface/datasets/pull/3458

Dataset Tasks

Create Language Modeling task by @albertvillanova in https://github.com/huggingface/datasets/pull/3387

Metric Changes

BLEURT: Match key names to correspond with filename by @jaehlee in https://github.com/huggingface/datasets/pull/3348
Fix links in metrics description by @albertvillanova in https://github.com/huggingface/datasets/pull/3461
Fix METEOR missing NLTK's omw-1.4 by @lhoestq in https://github.com/huggingface/datasets/pull/3469

Docs

Add ArrayXD docs by @stevhliu in https://github.com/huggingface/datasets/pull/3344
Document a training loop for streaming dataset by @lhoestq in https://github.com/huggingface/datasets/pull/3370
Fix formatting in IterableDataset.map docs by @mariosasko in https://github.com/huggingface/datasets/pull/3395
Correctly indent builder config in dataset script docs by @mariosasko in https://github.com/huggingface/datasets/pull/3432
Update BLEURT hyperlink by @lewtun in https://github.com/huggingface/datasets/pull/3437

Additional improvements and bug fixes

Quick fix error formatting by @NouamaneTazi in https://github.com/huggingface/datasets/pull/3328
Fix error message and add extension fallback by @mariosasko in https://github.com/huggingface/datasets/pull/3332
Avoid content-encoding issue while streaming datasets by @albertvillanova in https://github.com/huggingface/datasets/pull/3350
Fix JSON ClassLabel casting for integers by @lhoestq in https://github.com/huggingface/datasets/pull/3340
Better error message when download fails by @lhoestq in https://github.com/huggingface/datasets/pull/3343
Fix dict source_datasets tagset validator by @albertvillanova in https://github.com/huggingface/datasets/pull/3368
Fix typo in other-structured-to-text task tag by @albertvillanova in https://github.com/huggingface/datasets/pull/3367
Fix temporary dataset_path creation for URIs related to remote fs by @francisco-perez-sorrosal in https://github.com/huggingface/datasets/pull/3296
Fix flaky test of the temporary directory used by load_from_disk by @lhoestq in https://github.com/huggingface/datasets/pull/3388
More robust first elem check in encode/cast example by @mariosasko in https://github.com/huggingface/datasets/pull/3402
Fix module inference for archive with a directory by @albertvillanova in https://github.com/huggingface/datasets/pull/3406
Fix dependencies conflicts in Windows CI after conda update to 4.11 by @lhoestq in https://github.com/huggingface/datasets/pull/3410
Pass new_fingerprint in multiprocessing by @lhoestq in https://github.com/huggingface/datasets/pull/3409
Fix flaky test again for s3 serialization by @lhoestq in https://github.com/huggingface/datasets/pull/3412
Skip None encoding (line deleted by accident in #3195) by @mariosasko in https://github.com/huggingface/datasets/pull/3414
Clean squad dummy data by @lhoestq in https://github.com/huggingface/datasets/pull/3428
#3337 Add typing overloads to Dataset.getitem for mypy by @Dref360 in https://github.com/huggingface/datasets/pull/3382
Make cast cacheable (again) on Windows by @mariosasko in https://github.com/huggingface/datasets/pull/3429
Use max number of data files to infer module by @albertvillanova in https://github.com/huggingface/datasets/pull/3407
Fix iter_archive generator by @albertvillanova in https://github.com/huggingface/datasets/pull/3454
[Staging] Update dataset repos automatically on the Hub by @lhoestq in https://github.com/huggingface/datasets/pull/3451
Update supported versions of Python in setup.py by @mariosasko in https://github.com/huggingface/datasets/pull/3438
raise exception instead of using assertions. by @manisnesan in https://github.com/huggingface/datasets/pull/3349

New Contributors

@avinashsai made their first contribution in https://github.com/huggingface/datasets/pull/3330
@NouamaneTazi made their first contribution in https://github.com/huggingface/datasets/pull/3328
@davanstrien made their first contribution in https://github.com/huggingface/datasets/pull/3312
@francisco-perez-sorrosal made their first contribution in https://github.com/huggingface/datasets/pull/3296
@LashaO made their first contribution in https://github.com/huggingface/datasets/pull/3266
@fdschmidt93 made their first contribution in https://github.com/huggingface/datasets/pull/3371
@polinaeterna made their first contribution in https://github.com/huggingface/datasets/pull/3335
@AnzorGozalishvili made their first contribution in https://github.com/huggingface/datasets/pull/3383
@tianjianjiang made their first contribution in https://github.com/huggingface/datasets/pull/3342
@jingshenSN2 made their first contribution in https://github.com/huggingface/datasets/pull/3420
@scaperex made their first contribution in https://github.com/huggingface/datasets/pull/3436

Full Changelog: https://github.com/huggingface/datasets/compare/1.16.1...1.17.0

Nov 26, 2021

Bug fixes

Fix import datasets on python 3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/3326
Fix wrongly converted assert by @eliasws in https://github.com/huggingface/datasets/pull/3323

Datasets Changes

New: riddle_sense by @ziyiwu9494 in https://github.com/huggingface/datasets/pull/3161
New: Multi-Lingual LibriSpeech by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3198
New: XCSR by @yangxqiao in https://github.com/huggingface/datasets/pull/3074
New: CMU Hinglish DoG by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3149
New: Multidoc2dial by @sivasankalpp in https://github.com/huggingface/datasets/pull/3205
New: IndoNLI by @afaji in https://github.com/huggingface/datasets/pull/3307
Update: DaNE - updated URL for download by @MalteHB in https://github.com/huggingface/datasets/pull/3203
Update: xcopa - (fix checksum issues + add translated data) by @mariosasko in https://github.com/huggingface/datasets/pull/3254
Update: tatoeba - update to v2021-07-22 by @KoichiYasuoka in https://github.com/huggingface/datasets/pull/3225
Update: KILT - update metadata JSON by @albertvillanova in https://github.com/huggingface/datasets/pull/3276
Update: Covost 2 - update download instructions by @patrickvonplaten in https://github.com/huggingface/datasets/pull/3281
Update: Common Voice, OpenSLR, LibriSpeech ASR, Vivos - make several audio datasets streamable by @lhoestq in https://github.com/huggingface/datasets/pull/3290
Fix: tuple_ie - fix download url by @mariosasko in https://github.com/huggingface/datasets/pull/3213
Fix: id_newspapers_2018 - fix streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3249
Fix: bookcorpusopen - fix RAM usage by @lhoestq in https://github.com/huggingface/datasets/pull/3280
Fix: Scielo - fix ConnectionError by @mariosasko in https://github.com/huggingface/datasets/pull/3260
Fix: tatoeba - fix URLs for a subset of xtreme by @mariosasko in https://github.com/huggingface/datasets/pull/3321

Datasets Features

Push to hub capabilities for Dataset and DatasetDict by @LysandreJik in https://github.com/huggingface/datasets/pull/3098:
- upload your dataset to the Hugging face Hub with the push_to_hub() method !
- See documentation here
200+ datasets now support streaming:
- Stream TAR-based dataset using iter_archive by @lhoestq in https://github.com/huggingface/datasets/pull/3110
- Stream from Google Drive and other hosts by @lhoestq in https://github.com/huggingface/datasets/pull/3248
- Support Audio feature in streaming mode by @albertvillanova in https://github.com/huggingface/datasets/pull/3133
- Support Audio feature for TAR archives in sequential access by @albertvillanova in https://github.com/huggingface/datasets/pull/3129
Resolve data_files by split name automatically by @lhoestq in https://github.com/huggingface/datasets/pull/3221
- It takes into account the file names to know which file goes into which split
- See documentation here
Filter method for batched=True by @thomasw21 in https://github.com/huggingface/datasets/pull/3244
Adding with_rank arg to pass process rank to map by @TevenLeScao in https://github.com/huggingface/datasets/pull/3314

Dataset Cards

Add full tagset to conll2003 README by @BramVanroy in https://github.com/huggingface/datasets/pull/3230
Fix some contact information formats by @lhoestq in https://github.com/huggingface/datasets/pull/3274
Add wikipedia tags by @lhoestq in https://github.com/huggingface/datasets/pull/3301
Updating details of IRC disentanglement data by @jkkummerfeld in https://github.com/huggingface/datasets/pull/3259

Metrics Changes

New: OpenAI's pass@k code evaluation metric by @lvwerra in https://github.com/huggingface/datasets/pull/2916
Update: BLEURT - options to use updated bleurt checkpoints by @jaehlee in https://github.com/huggingface/datasets/pull/3235
Update: CER - update to support latest release by @mariosasko in https://github.com/huggingface/datasets/pull/3252
Update: WER - update to the documentation by @wooters in https://github.com/huggingface/datasets/pull/3278

Documentation

Add docs for to_tf_dataset by @stevhliu in https://github.com/huggingface/datasets/pull/3175
Small updates to to_tf_dataset documentation by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3215
Update link to Datasets Tagging app in Spaces by @albertvillanova in https://github.com/huggingface/datasets/pull/3194
Improve repository structure docs by @lhoestq in https://github.com/huggingface/datasets/pull/3233
Swap descriptions of v1 and raw-v1 configs of WikiText dataset and fix metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/3241
Add docs for audio processing by @stevhliu in https://github.com/huggingface/datasets/pull/3222
Add push_to_hub docs by @lhoestq in https://github.com/huggingface/datasets/pull/3319

Additional improvements and bug fixes

Catch token invalid error in CI by @lhoestq in https://github.com/huggingface/datasets/pull/3200
Pin keras version until TF fixes its release by @albertvillanova in https://github.com/huggingface/datasets/pull/3208
Fix disable_nullable default value to False by @lhoestq in https://github.com/huggingface/datasets/pull/3211
Fix code quality in riddle_sense dataset by @albertvillanova in https://github.com/huggingface/datasets/pull/3218
Better error msg if len(predictions) doesn't match len(references) in metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3160
Use huggingface_hub.HfApi to list datasets/metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3121
Pin version exclusion for tensorflow incompatible with keras by @albertvillanova in https://github.com/huggingface/datasets/pull/3216
Group tests in multiprocessing workers by test file by @albertvillanova in https://github.com/huggingface/datasets/pull/3231
Fix load_from_disk temporary directory by @lhoestq in https://github.com/huggingface/datasets/pull/3245
[tiny] fix typo in stream docs by @nollied in https://github.com/huggingface/datasets/pull/3246
Avoid PyArrow type optimization if it fails by @mariosasko in https://github.com/huggingface/datasets/pull/3234
Remove redundant isort module placement by @mariosasko in https://github.com/huggingface/datasets/pull/3243
asserts replaced by exception for text classification task with test. by @manisnesan in https://github.com/huggingface/datasets/pull/3256
Add os.listdir for streaming by @lhoestq in https://github.com/huggingface/datasets/pull/3270
asserts replaced with exception for image classification task, csv, json by @manisnesan in https://github.com/huggingface/datasets/pull/3262
Force data files extraction if download_mode='force_redownload' by @mariosasko in https://github.com/huggingface/datasets/pull/3275
Minor Typo Fix - Precision to Recall by @SebastinSanty in https://github.com/huggingface/datasets/pull/3279
Decode audio from remote by @lhoestq in https://github.com/huggingface/datasets/pull/3271
Fix build_docs CI by @lhoestq in https://github.com/huggingface/datasets/pull/3286
Allow datasets with indices table when concatenating along axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3288
f-string formatting by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3277
Unpin markdown for build_docs now that it's fixed by @lhoestq in https://github.com/huggingface/datasets/pull/3289
Pin version exclusion for Markdown by @albertvillanova in https://github.com/huggingface/datasets/pull/3293
Use f-strings in the dataset scripts by @Carlosbogo in https://github.com/huggingface/datasets/pull/3291
fix old_val typo in f-string by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3302
asserts replaced with exception for fingerprint.py, search.py, arrow_writer.py and metric.py by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3305
fix: files counted twice in inferred structure by @borisdayma in https://github.com/huggingface/datasets/pull/3309
Finish transition to PyArrow 3.0.0 by @mariosasko in https://github.com/huggingface/datasets/pull/3318
Removing query params for dynamic URL caching by @anton-l in https://github.com/huggingface/datasets/pull/3315

Citation

Update BibTeX entry by @albertvillanova in https://github.com/huggingface/datasets/pull/3223
Fix paper BibTeX citation with proceedings reference by @albertvillanova in https://github.com/huggingface/datasets/pull/3226
Add CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3228
Fix URL in CITATION file by @albertvillanova in https://github.com/huggingface/datasets/pull/3229

Deprecations

Deprecate prepare_module by @albertvillanova in https://github.com/huggingface/datasets/pull/3166

Full Changelog: https://github.com/huggingface/datasets/compare/1.15.1...1.16.0

Nov 2, 2021

Dependencies

Bump huggingface_hub to 0.1.0 by @lhoestq in https://github.com/huggingface/datasets/pull/3199

Dataset Changes

Update: JNLBA - add tags names by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3092
Update: OpenSLR - add SLR83 to OpenSLR by @tyrius02 in https://github.com/huggingface/datasets/pull/3125 and https://github.com/huggingface/datasets/pull/3176
Update: RONEC - update to v2 by @dumitrescustefan in https://github.com/huggingface/datasets/pull/3184
Fix: Arabic Billion Words - Fix script to return all data by @albertvillanova in https://github.com/huggingface/datasets/pull/3136
Fix: HLGD - fix label mapping by @VictorSanh in https://github.com/huggingface/datasets/pull/3180

Dataset Features

Allow dynamic first dimension for ArrayXD by @rpowalski in https://github.com/huggingface/datasets/pull/2891
add multi-proc in to_csv by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/2896
QOL improvements: auto-flatten_indices and desc in map calls by @mariosasko in https://github.com/huggingface/datasets/pull/3196

Dataset Cards

Fill in dataset card for NCBI disease dataset by @edugp in https://github.com/huggingface/datasets/pull/3115

Metrics Changes

New: metric for the MATH dataset (competition_math). by @hacobe in https://github.com/huggingface/datasets/pull/3020
New: Google BLEU (aka GLEU) metric by @slowwavesleep in https://github.com/huggingface/datasets/pull/3108
New: TER by @BramVanroy in https://github.com/huggingface/datasets/pull/3153
New: ChrF(++) by @BramVanroy in https://github.com/huggingface/datasets/pull/3187

General improvements and bug fixes

Correctly update metadata to preserve features when concatenating datasets with axis=1 by @mariosasko in https://github.com/huggingface/datasets/pull/3120
Fixes to to_tf_dataset by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3085
Add security policy to the project by @albertvillanova in https://github.com/huggingface/datasets/pull/2958
Update doc links to point to new docs by @mariosasko in https://github.com/huggingface/datasets/pull/3116
Fix caching bugs by @mariosasko in https://github.com/huggingface/datasets/pull/3141
Fix numpy deprecation warning for ragged tensors by @lhoestq in https://github.com/huggingface/datasets/pull/3137
Fixed: duplicate parameter and missing parameter in docstring by @PanQiWei in https://github.com/huggingface/datasets/pull/3157
Fix some typos in the documentation by @h4iku in https://github.com/huggingface/datasets/pull/3152
Fix string encoding for Value type by @lhoestq in https://github.com/huggingface/datasets/pull/3158
Fix CLI test to ignore verfications when saving infos by @albertvillanova in https://github.com/huggingface/datasets/pull/3147
Make inspect.get_dataset_config_names always return a non-empty list by @albertvillanova in https://github.com/huggingface/datasets/pull/3159
Fix issue with filelock filename being too long on encrypted filesystems by @mariosasko in https://github.com/huggingface/datasets/pull/3173
Asserts replaced by exceptions (huggingface#3171) by @joseporiolayats in https://github.com/huggingface/datasets/pull/3174
Preserve ordering in zip_dict by @mariosasko in https://github.com/huggingface/datasets/pull/3170
Don't memoize strings when hashing since two identical strings may have different python ids by @lhoestq in https://github.com/huggingface/datasets/pull/3182
Re-add faiss to windows testing suite by @BramVanroy in https://github.com/huggingface/datasets/pull/3151
Add missing docstring to DownloadConfig by @mariosasko in https://github.com/huggingface/datasets/pull/3183
More efficient nested features encoding by @eladsegal in https://github.com/huggingface/datasets/pull/3124
Fix optimized encoding for arrays by @lhoestq in https://github.com/huggingface/datasets/pull/3197

Oct 19, 2021

Dataset changes

Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
Update: SUPERB - use Audio features #3101 (@anton-l)
Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)

Dataset features

Add iter_archive #3066 (@lhoestq)

General improvements and bug fixes

Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
Fix project description in PyPI #3103 (@albertvillanova)
Align tqdm control with cache control #3031 (@mariosasko)
Add paper BibTeX citation #3107 (@albertvillanova)

Oct 15, 2021

Dataset changes

Update: Adapt all audio datasets #3081 (@patrickvonplaten)

Bug fixes

Update BibTeX entry #3090 (@albertvillanova)
Use template column_mapping to transmit_format instead of template features #3088 (@mariosasko)
Fix Audio feature mp3 resampling #3096 (@albertvillanova)

Oct 14, 2021

Bug fixes

Fix error related to huggingface_hub timeout parameter #3082 (@albertvillanova)
Remove _resampler from Audio fields #3086 (@albertvillanova)

Bug fixes

Fix loading a metric with internal import #3077 (@albertvillanova)

Oct 13, 2021

Dataset changes

New: CaSiNo #2867 (@kushalchawla)
New: Mostly Basic Python Problems #2893 (@lvwerra)
New: OpenAI's HumanEval #2897 (@lvwerra)
New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
New: SEDE #2942 (@Hazoom)
New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
New: AMI #2853 (@cahya-wirawan)
New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
New: KanHope #2985 (@adeepH)
New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
New: SwedMedNER #2940 (@bwang482)
New: SberQuAD #3039 (@Alenush)
New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
New: Greek Legal Code #2966 (@christospi)
New: Story Cloze Test #3067 (@zaidalyafeai)
Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
Update: TriviaQA - add web and wiki config #2949 (@shirte)
Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
Update: Biosses - fix column names #3054 (@bwang482)
Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)

Metric changes

Update: meteor - update from nltk update #2946 (@lhoestq)
Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)

Dataset features

Use with TensorFlow:
- Adding to_tf_dataset method #2731 #2931 #2951 #2974 (@Rocketknight1)
Better support for ZIP files:
- Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
- Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
Streaming improvements:
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Add remove_columns to IterableDataset #3030 (@cccntu)
- All the above ZIP features also work in streaming mode
New utilities:
- Add get_dataset_split_names() to get a dataset config's split names #2906 (@severo)
Replace script_version with revision #2933 (@albertvillanova)
- The script_version parameter in load_dataset is now deprecated, in favor of revision
Experimental - Create Audio feature type #2324 (@albertvillanova):
- It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed

Dataset cards

Add arxiv paper inswiss_judgment_prediction dataset card #3026 (@JoelNiklaus)

Documentation

Add tutorial for no-code dataset upload #2925 (@stevhliu)

General improvements and bug fixes

Fix filter leaking #3019 (@lhoestq)
- calling filter several times in a row was not returning the right results in 1.12.0 and 1.12.1
Update BibTeX entry #2928 (@albertvillanova)
Fix exception chaining #2911 (@albertvillanova)
Add regression test for null Sequence #2929 (@albertvillanova)
Don't use old, incompatible cache for the new filter #2947 (@lhoestq)
Fix fn kwargs in filter #2950 (@lhoestq)
Use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast #2895 (@arsarabi)
Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
Fix missing conda deps #2952 (@lhoestq)
Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
Support pandas 1.3 new read_csv parameters #2960 (@SBrandeis)
Fix CI doc build #2961 (@albertvillanova)
Run tests in parallel #2954 (@albertvillanova)
Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
Take namespace into account in caching #2938 (@lhoestq)
Make Dataset.map accept list of np.array #2990 (@albertvillanova)
Fix loading compressed CSV without streaming #2994 (@albertvillanova)
Fix json loader when conversion not implemented #3000 (@lhoestq)
Remove all query parameters when extracting protocol #2996 (@albertvillanova)
Correct a typo #3007 (@Yann21)
Fix Windows test suite #3025 (@albertvillanova)
Remove unused parameter in xdirname #3017 (@albertvillanova)
Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
Fix typo #3023 (@qqaatw)
Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
Use cache folder for lockfile #2887 (@Dref360)
Fix streaming: catch Timeout error #3050 (@borisdayma)
Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
Fix task reloading from cache #3059 (@lhoestq)
Fix test command after refac #3065 (@lhoestq)
Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
Update summary on PyPi beyond NLP #3062 (@thomwolf)
Remove a reference to the open Arrow file when deleting a TF dataset created with to_tf_dataset #3002 (@mariosasko)
feat: increase streaming retry config #3068 (@borisdayma)
Fix pathlib patches for streaming #3072 (@lhoestq)

Breaking changes:

Due to the big refactoring at #2986, the prepare_module function doesn't support the return_resolved_file_path and return_associated_base_path parameters. As an alternative, you may use the dataset_module_factory instead.

Sep 15, 2021

Bug fixes

Fix fsspec AbstractFileSystem access #2915 (@pierre-godard)
Fix unwanted tqdm bar when accessing examples #2920 (@lhoestq)
Fix conversion of multidim arrays in list to arrow #2922 (@lhoestq):
- this fixes the ArrowInvalid: Can only convert 1-dimensional array values errors

Sep 13, 2021

New documentation

New documentation structure #2718 (@stevhliu):
- New: Tutorials
- New: Hot-to guides
- New: Conceptual guides
- Update: Reference

See the new documentation here !

Datasets changes

New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
New: The Pile books3 #2801 (@richarddwang)
New: The Pile stack exchange #2803 (@richarddwang)
New: The Pile openwebtext2 #2802 (@richarddwang)
New: Food-101 #2804 (@nateraw)
New: Beans #2809 (@nateraw)
New: cedr #2796 (@naumov-al)
New: cats_vs_dogs #2807 (@nateraw)
New: MultiEURLEX #2865 (@iliaschalkidis)
New: BIOSSES #2881 (@bwang482)
Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
Update: SUPERB - Add SD task #2661 (@albertvillanova)
Update: SUPERB - Add KS task #2783 (@anton-l)
Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
Update: Openwebtext - update size #2857 (@lhoestq)
Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
Fix: turkish_movie_sentiment - fix metadata JSON #2755 (@albertvillanova)
Fix: ubuntu_dialogs_corpus - fix metadata JSON #2756 (@albertvillanova)
Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
Fix: linnaeus - fix url #2852 (@lhoestq)
Fix ToTTo - fix data URL #2864 (@albertvillanova)
Fix: wikicorpus - fix keys #2844 (@lhoestq)
Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)

Datasets features

Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
add multi-proc in to_json #2747 (@bhavitvyamalik)
Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)

Dataset streaming - better support for compression:

Fix streaming zip files #2798 (@albertvillanova)
Support streaming tar files #2800 (@albertvillanova)
Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
Add url prefix convention for many compression formats #2822 (@lhoestq)
Support streaming datasets that use pathlib #2874 (@albertvillanova)
Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)

Metrics changes

Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)

Dataset cards

Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
Update ELI5 README.md #2848 (@odellus)

General improvements and bug fixes

Update release instructions #2740 (@albertvillanova)
Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
Allow PyArrow from source #2769 (@patrickvonplaten)
fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
Fix typo in test_dataset_common #2790 (@nateraw)
Fix type hint for data_files #2793 (@albertvillanova)
Bump tqdm version #2814 (@mariosasko)
Use packaging to handle versions #2777 (@albertvillanova)
Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
Rename The Pile subsets #2817 (@lhoestq)
Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
Fix extraction protocol inference from urls with params #2843 (@lhoestq)
Fix caching when moving script #2854 (@lhoestq)
Fix windows CI CondaError #2855 (@lhoestq)
fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
Update column_names showed as :func: in exploring.st #2851 (@ClementRomac)
Fix s3fs version in CI #2858 (@lhoestq)
Fix three typos in two files for documentation #2870 (@leny-mi)
Move checks from _map_single to map #2660 (@mariosasko)
fix regex to accept negative timezone #2847 (@jadermcs)
Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
Fix null sequence encoding #2900 (@lhoestq)

Jul 30, 2021

Datasets Changes

New: Add Russian SuperGLUE #2668 (@slowwavesleep)
New: Add Disfl-QA #2473 (@bhavitvyamalik)
New: Add TimeDial #2476 (@bhavitvyamalik)
Fix: Enumerate all ner_tags values in WNUT 17 dataset #2713 (@albertvillanova)
Fix: Update WikiANN data URL #2710 (@albertvillanova)
Fix: Update PAN-X data URL in XTREME dataset #2715 (@albertvillanova)
Fix: C4 - en subset by modifying dataset_info with correct validation infos #2723 (@thomasw21)

General improvements and bug fixes

fix: 🐛 change string format to allow copy/paste to work in bash #2694 (@severo)
Update BibTeX entry #2706 (@albertvillanova)
Print absolute local paths in load_dataset error messages #2684 (@mariosasko)
Add support for disable_progress_bar on Windows #2696 (@mariosasko)
Ignore empty batch when writing #2698 (@pcuenca)
Fix shuffle on IterableDataset that disables batching in case any functions were mapped #2717 (@amankhandelia)
fix: 🐛 fix two typos #2720 (@severo)
Docs details #2690 (@severo)
Deal with the bad check in test_load.py #2721 (@mariosasko)
Pass use_auth_token to request_etags #2725 (@albertvillanova)
Typo fix tokenize_exemple #2726 (@shabie)
Fix IndexError while loading Arabic Billion Words dataset #2729 (@albertvillanova)
Add missing parquet known extension #2733 (@lhoestq)

Jul 22, 2021

The error message to tell which dataset config name to load was not displayed:

Fix pick default config name message #2704 (@lhoestq)

Docstrings:

Fix download_mode docstrings #2701 (@albertvillanova)

Fix minimum tqdm version and import on Colab #2697 (@nateraw)
Fix OSCAR Esperanto #2693 (@lhoestq)

Jul 21, 2021

Datasets Features

Support remote data files #2616 (@albertvillanova) This allows to pass URLs of remote data files to any dataset loader:
```
load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})
```
This works for all these dataset loaders:
- text
- csv
- json
- parquet
- pandas
Streaming from remote text/json/csv/parquet/pandas files: When you pass URLs to a dataset loader, you can enable streaming mode with streaming=True. Main contributions:
- Streaming for the Pandas loader #2636 (@lhoestq)
- Streaming for the CSV loader #2635 (@lhoestq)
- Streaming for the Json loader #2608 (@albertvillanova) #2638 (@lhoestq)
Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzalik)
Delete extracted files when loading dataset #2631 (@albertvillanova)

Datasets Changes

Fix: C4 - fix expected files list #2682 (@lhoestq)
Fix: SQuAD - fix misalignment #2586 (@albertvillanova)
Fix: omp - fix DuplicatedKeysError#2603 (@albertvillanova)
Fix: wi_locness - potential DuplicatedKeysError #2609 (@albertvillanova)
Fix: LibriSpeech - potential DuplicatedKeysError #2672 (@albertvillanova)
Fix: SQuAD - potential DuplicatedKeysError #2673 (@albertvillanova)
Fix: Blog Authorship Corpus - fix split sizes and text encoding #2685 (@albertvillanova)

Dataset Tasks

Add speech processing tasks #2620 (@lewtun)
Update ASR tags #2633 (@lewtun)
Inject ASR template for lj_speech dataset #2634 (@albertvillanova)
Add ASR task for SUPERB #2619 (@lewtun)
add image-classification task template #2632 (@nateraw)

Metrics Changes

New: wiki_split #2623 (@bhadreshpsavani)
Update: accuracy,f1,precision,recall - Support multilabel metrics #2589 (@albertvillanova)
Fix: sacrebleu - fix parameter name #2674 (@albertvillanova)

General improvements and bug fixes

Fix BibTeX entry #2594 (@albertvillanova)
Fix test_is_small_dataset #2588 (@albertvillanova)
Remove import of transformers #2602 (@albertvillanova)
Make any ClientError trigger retry in streaming mode (e.g. ClientOSError) #2605 (@lhoestq)
Fix filter with multiprocessing in case all samples are discarded #2601 (@mxschmdt)
Remove redundant prepare_module #2597 (@albertvillanova)
Create ExtractManager #2295 (@albertvillanova)
Return Python float instead of numpy.float64 in sklearn metrics #2612 (@lewtun)
Use ndarray.item instead of ndarray.tolist #2613 (@lewtun)
Convert numpy scalar to python float in Pearsonr output #2614 (@lhoestq)
Fix missing EOL issue in to_json for old versions of pandas #2617 (@lhoestq)
Use correct logger in metrics.py #2626 (@mariosasko)
Minor fix tests with Windows paths #2627 (@albertvillanova)
Use ETag of remote data files #2628 (@albertvillanova)
More consistent naming #2611 (@mariosasko)
Refactor patching to specific submodule #2639 (@albertvillanova)
Fix docstrings #2640 (@albertvillanova)
Fix anchor in README #2647 (@mariosasko)
Fix logging docstring #2652 (@mariosasko)
Allow dataset config kwargs to be None #2659 (@lhoestq)
Use prefix to allow exceed Windows MAX_PATH #2621 (@albertvillanova)
Use tqdm from tqdm_utils #2667 (@mariosasko)
Increase json reader block_size automatically #2676 (@lhoestq)
Parallelize ETag requests #2675 (@lhoestq)
Fix bad config ids that name cache directories #2686 (@lhoestq)
Minor documentation fix #2687 (@slowwavesleep)

Dataset Cards

Add missing WikiANN language tags #2610 (@albertvillanova)
feat: 🎸 add paperswithcode id for qasper dataset #2680 (@severo)

Docs

Update processing.rst with other export formats #2599 (@TevenLeScao)

Jul 5, 2021

Datasets Changes

New: C4 #2575 #2592 (@lhoestq)
New: mC4 #2576 (@lhoestq)
New: MasakhaNER #2465 (@dadelani)
New: Eduge #2492 (@enod)
Update: xor_tydi_qa - update version #2455 (@cccntu)
Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
Update: udpos - change features structure #2466 (@jerryIsHere)
Update: WebNLG - update checksums #2558 (@lhoestq)
Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
Fix: proto_qa - fix download link #2463 (@mariosasko)
Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
Fix: code_search_net - fix keys #2555 (@lhoestq)
Fix: discofuse - fix link cc #2541 (@VictorSanh)
Fix: fever - fix keys #2557 (@lhoestq)

Datasets Features

Dataset Streaming #2375 #2582 (@lhoestq)
- Fast download and process your data on-the-fly when iterating over your dataset
- Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
JAX integration #2502 (@lhoestq)
Add Parquet loader + from_parquet and to_parquet #2537 (@lhoestq)
Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
Set configurable downloaded datasets path #2488 (@albertvillanova)
Set configurable extracted datasets path #2487 (@albertvillanova)
Add align_labels_with_mapping function #2457 (@lewtun) #2510 (@lhoestq)
Add interleave_datasets for map-style datasets #2568 (@lhoestq)
Add load_dataset_builder #2500 (@mariosasko)
Support Zstandard compressed files #2578 (@albertvillanova)

Task templates

Add task templates for tydiqa and xquad #2518 (@lewtun)
Insert text classification template for Emotion dataset #2521 (@lewtun)
Add summarization template #2529 (@lewtun)
Add task template for automatic speech recognition #2533 (@lewtun)
Remove task templates if required features are removed during Dataset.map #2540 (@lewtun)
Inject templates for ASR datasets #2565 (@lewtun)

General improvements and bug fixes

Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
Allow latest pyarrow version #2490 (@albertvillanova)
Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
Add Zenodo metadata file with license #2501 (@albertvillanova)
add tensorflow-macos support #2493 (@slayerjain)
Keep original features order #2453 (@albertvillanova)
Add course banner #2506 (@sgugger)
Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
Improve performance of pandas arrow extractor #2519 (@albertvillanova)
Fix fingerprint when moving cache dir #2509 (@lhoestq)
Replace bad n>1M size tag #2527 (@lhoestq)
Fix dev version #2531 (@lhoestq)
Sync with transformers disabling NOTSET #2534 (@albertvillanova)
Fix logging levels #2544 (@albertvillanova)
Add support for Split.ALL #2259 (@mariosasko)
Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
Make numpy arrow extractor faster #2505 (@lhoestq)
fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
Add ASR task and new languages to resources #2567 (@lewtun)
Filter expected warning log from transformers #2571 (@albertvillanova)
Fix BibTeX entry #2579 (@albertvillanova)
Fix Counter import #2580 (@albertvillanova)
Add aiohttp to tests extras require #2587 (@albertvillanova)
Add language tags #2590 (@lewtun)
Support pandas 1.3.0 read_csv #2593 (@lhoestq)

Dataset cards

Updated Dataset Description #2420 (@binny-mathew)
Update DatasetMetadata and ReadMe #2436 (@gchhablani)
CRD3 dataset card #2515 (@wilsonyhlee)
Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)

Docs

no s at load_datasets #2479 (@julien-c)
Fix docs custom stable version #2477 (@albertvillanova)
Improve Features docs #2535 (@albertvillanova)
Update README.md #2414 (@cryoff)
Fix FileSystems documentation #2551 (@connor-mccarthy)
Minor fix in loading metrics docs #2562 (@albertvillanova)
Minor fix docs format for bertscore #2570 (@albertvillanova)
Add streaming in load a dataset docs #2574 (@lhoestq)

Jun 8, 2021

Datasets Changes

New: Microsoft CodeXGlue Datasets #2357 (@madlag @ncoop57)
New: KLUE benchmark #2416 (@jungwhank)
New: HendrycksTest #2370 (@andyzoujm)
Update: xor_tydi_qa - update url to v1.1 #2449 (@cccntu)
Fix: adversarial_qa - DuplicatedKeysError #2433 (@mariosasko)
Fix: bn_hate_speech and covid_tweets_japanese - fix broken URLs for #2445 (@lewtun)
Fix: flores - fix download link #2448 (@mariosasko)

Datasets Features

Add desc parameter in map for DatasetDict object #2423 (@bhavitvyamalik)
Support sliced list arrays in cast #2461 (@lhoestq)
- Dataset.cast can now change the feature types of Sequence fields
Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
- we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
- we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
- users can still set keep_in_memory=True when loading a dataset to load it in memory

Datasets Cards

adds license information for DailyDialog. #2419 (@aditya2211)
add english language tags for ~100 datasets #2442 (@VictorSanh)
Add copyright info to MLSUM dataset #2427 (@PhilipMay)
Add copyright info for wiki_lingua dataset #2428 (@PhilipMay)
Mention that there are no answers in adversarial_qa test set #2451 (@lhoestq)

General improvements and bug fixes

Add DOI badge to README #2411 (@albertvillanova)
Make datasets PEP-561 compliant #2417 (@SBrandeis)
Fix save_to_disk nested features order in dataset_info.json #2422 (@lhoestq)
Fix CI six installation on linux #2432 (@lhoestq)
Fix Docstring Mistake: dataset vs. metric #2425 (@PhilipMay)
Fix NQ features loading: reorder fields of features to match nested fields order in arrow data #2438 (@lhoestq)
doc: fix typo HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2421 (@borisdayma)
add utf-8 while reading README #2418 (@bhavitvyamalik)
Better error message when trying to access elements of a DatasetDict without specifying the split #2439 (@lhoestq)
Rename config and environment variable for in memory max size #2454 (@albertvillanova)
Add version-specific BibTeX #2430 (@albertvillanova)
Fix cross-reference typos in documentation #2456 (@albertvillanova)
Better error message when using the wrong load_from_disk #2437 (@lhoestq)

Experimental and work in progress: Format a dataset for specific tasks

Update text classification template labels in DatasetInfo post_init #2392 (@lewtun)
Insert task templates for text classification #2389 (@lewtun)
Rename QuestionAnswering template to QuestionAnsweringExtractive #2429 (@lewtun)
Insert Extractive QA templates for SQuAD-like datasets #2435 (@lewtun)

May 27, 2021

Dataset Changes

New: NLU evaluation data #2238 (@dkajtoch)
New: Add SLR32, SLR52, SLR53 to OpenSLR #2241, #2311 (@cahya-wirawan)
New: Bbaw egyptian #2290 (@phiwi)
New: GooAQ #2260 (@bhavitvyamalik)
New: SubjQA #2302 (@lewtun)
New: Ascent KB #2341, #2349 (@phongnt570)
New: HLGD #2325 (@tingofurro)
New: Qasper #2346 (@cceyda)
New: ConvQuestions benchmark #2372 (@PhilippChr)
Update: Wikihow - Clarify how to load wikihow #2240 (@albertvillanova)
Update multi_woz_v22 - update checksum #2281 (@lhoestq)
Update: OSCAR - Set encoding in OSCAR dataset #2321 (@albertvillanova)
Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME #2326 (@lewtun)
Update: GEM - the DART file checksums in GEM #2334 (@yjernite)
Update: web_science - fixed download link #2338 (@bhavitvyamalik)
Update: SNLI, MNLI- README updated for SNLI, MNLI #2364 (@bhavitvyamalik)
Update: conll2003 - correct labels #2369 (@philschmid)
Update: offenseval_dravidian - update citations #2385 (@adeepH)
Update: ai2_arc - Add dataset tags #2405 (@OyvindTafjord)
Fix: newsph_nli - test data added, dataset_infos updated #2263 (@bhavitvyamalik)
Fix: hyperpartisan news detection - Remove getchildren #2367 (@ghomasHudson)
Fix: indic_glue - Fix number of classes in indic_glue sna.bn dataset #2397 (@albertvillanova)
Fix: head_qa - Fix keys #2408 (@lhoestq)

Dataset Features

Implement Dataset add_item #1870 (@albertvillanova)
Implement Dataset add_column #2145 (@albertvillanova)
Implement Dataset to JSON #2248, #2352 (@albertvillanova)
Add rename_columnS method #2312 (@SBrandeis)
add desc to tqdm in Dataset.map() #2374 (@bhavitvyamalik)
Add env variable HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399, #2409 (@albertvillanova)

Metric Changes

New: CUAD metrics #2273 (@bhavitvyamalik)
New: Matthews/Pearson/Spearman correlation metrics #2328 (@lhoestq)
Update: CER - Docs, CER above 1 #2342 (@borisdayma)

General improvements and bug fixes

Update black #2265 (@lhoestq)
Fix incorrect update_metadata_with_features calls in ArrowDataset #2258 (@mariosasko)
Faster map w/ input_columns & faster slicing w/ Iterable keys #2246 (@norabelrose)
Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers #2268 (@lhoestq)
Fix query table with iterable #2269 (@lhoestq)
Perform minor refactoring: use config #2253 (@albertvillanova)
Update format, fingerprint and indices after add_item #2254 (@lhoestq)
Always update metadata in arrow schema #2274 (@lhoestq)
Make tests run faster #2266 (@lhoestq)
Fix metadata validation with config names #2286 (@lhoestq)
Fixed typo seperate->separate #2292 (@laksh9950)
Allow collaborators to self-assign issues #2289 (@albertvillanova)
Mapping in the distributed setting #2298 (@TevenLeScao)
Fix conda release #2309 (@lhoestq)
Fix incorrect version specification for the pyarrow package #2317 (@cemilcengiz)
Set default name in init_dynamic_modules #2320 (@albertvillanova)
Fix duplicate keys #2333 (@lhoestq)
Add note about indices mapping in save_to_disk docstring #2332 (@lhoestq)
Metadata validation #2107 (@theo-m)
Add Validation For README #2121 (@gchhablani)
Fix overflow issue in interpolation search #2336 (@mariosasko)
Datasets cli improvements #2315 (@mariosasko)
Add key type and duplicates verification with hashing #2245 (@NikhilBartwal)
More consistent copy logic #2340 (@mariosasko)
Update README vallidation rules #2353 (@gchhablani)
normalized TOCs and titles in data cards #2355 (@yjernite)
simpllify faiss index save #2351 (@Guitaricet)
Allow "other-X" in licenses #2368 (@gchhablani)
Improve ReadInstruction logic and update docs #2261 (@mariosasko)
Disallow duplicate keys in yaml tags #2379 (@lhoestq)
maintain YAML structure reading from README #2380 (@bhavitvyamalik)
add dataset card title #2381 (@bhavitvyamalik)
Add tests for dataset cards #2348 (@gchhablani)
Improve example in rounding docs #2383 (@mariosasko)
Paperswithcode dataset mapping #2404 (@julien-c)
Free datasets with cache file in temp dir on exit #2403 (@mariosasko)

Experimental and work in progress: Format a dataset for specific tasks

Task formatting for text classification & question answering #2255 (@SBrandeis)
Add check for task templates on dataset load #2390 (@lewtun)
Add args description to DatasetInfo #2384 (@lewtun)
Improve task api code quality #2376 (@mariosasko)

Apr 30, 2021

Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq) This affected methods like concatenate_datasets, multiprocessed map and load_from_disk.

Breaking change:

when using Dataset.map with the input_columns parameter, the resulting dataset will only have the columns from input_columns and the columns added by the map functions. The other columns are discarded.

Previous 3 4 5Next

Similar releases

Other sources from this team

Similar sources

Latest

4.8.4

Source

@huggingface/datasets

Tracking Since

Apr 30, 2021

Last checked Apr 20, 2026

.json·.md·.atom

Datasets

Dataset Changes

Dataset Features

Dataset cards

Dataset Tasks

Metric Changes

Docs

Additional improvements and bug fixes

New Contributors

Bug fixes

Datasets Changes

Datasets Features

Dataset Cards

Metrics Changes

Documentation

Additional improvements and bug fixes

Citation

Deprecations

Dependencies

Dataset Changes

Dataset Features

Dataset Cards

Metrics Changes

General improvements and bug fixes

Dataset changes

Dataset features

General improvements and bug fixes

Dataset changes

Bug fixes

Bug fixes

Bug fixes

Dataset changes

Metric changes

Dataset features

Dataset cards

Documentation

General improvements and bug fixes

Breaking changes:

Bug fixes

New documentation

Datasets changes

Datasets features

Dataset streaming - better support for compression:

Metrics changes

Dataset cards

General improvements and bug fixes

Datasets Changes

General improvements and bug fixes

Datasets Features

Datasets Changes

Dataset Tasks

Metrics Changes

General improvements and bug fixes

Dataset Cards

Docs

Datasets Changes

Datasets Features

Task templates

General improvements and bug fixes

Dataset cards

Docs

Datasets Changes

Datasets Features

Datasets Cards

General improvements and bug fixes

Experimental and work in progress: Format a dataset for specific tasks

Dataset Changes

Dataset Features

Metric Changes

General improvements and bug fixes

Experimental and work in progress: Format a dataset for specific tasks

More from this team

Similar releases

Other sources from this team

Similar sources

More from this team

Similar releases

Other sources from this team

Similar sources