1.13.0 — Datasets — releases.sh

Dataset changes

New: CaSiNo #2867 (@kushalchawla)
New: Mostly Basic Python Problems #2893 (@lvwerra)
New: OpenAI's HumanEval #2897 (@lvwerra)
New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
New: SEDE #2942 (@Hazoom)
New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
New: AMI #2853 (@cahya-wirawan)
New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
New: KanHope #2985 (@adeepH)
New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
New: SwedMedNER #2940 (@bwang482)
New: SberQuAD #3039 (@Alenush)
New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
New: Greek Legal Code #2966 (@christospi)
New: Story Cloze Test #3067 (@zaidalyafeai)
Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
Update: TriviaQA - add web and wiki config #2949 (@shirte)
Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
Update: Biosses - fix column names #3054 (@bwang482)
Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)

Metric changes

Update: meteor - update from nltk update #2946 (@lhoestq)
Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)

Dataset features

Use with TensorFlow:
- Adding to_tf_dataset method #2731 #2931 #2951 #2974 (@Rocketknight1)
Better support for ZIP files:
- Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
- Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
Streaming improvements:
- Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
- Add remove_columns to IterableDataset #3030 (@cccntu)
- All the above ZIP features also work in streaming mode
New utilities:
- Add get_dataset_split_names() to get a dataset config's split names #2906 (@severo)
Replace script_version with revision #2933 (@albertvillanova)
- The script_version parameter in load_dataset is now deprecated, in favor of revision
Experimental - Create Audio feature type #2324 (@albertvillanova):
- It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed

Dataset cards

Add arxiv paper inswiss_judgment_prediction dataset card #3026 (@JoelNiklaus)

Documentation

Add tutorial for no-code dataset upload #2925 (@stevhliu)

General improvements and bug fixes

Fix filter leaking #3019 (@lhoestq)
- calling filter several times in a row was not returning the right results in 1.12.0 and 1.12.1
Update BibTeX entry #2928 (@albertvillanova)
Fix exception chaining #2911 (@albertvillanova)
Add regression test for null Sequence #2929 (@albertvillanova)
Don't use old, incompatible cache for the new filter #2947 (@lhoestq)
Fix fn kwargs in filter #2950 (@lhoestq)
Use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast #2895 (@arsarabi)
Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
Fix missing conda deps #2952 (@lhoestq)
Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
Support pandas 1.3 new read_csv parameters #2960 (@SBrandeis)
Fix CI doc build #2961 (@albertvillanova)
Run tests in parallel #2954 (@albertvillanova)
Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
Take namespace into account in caching #2938 (@lhoestq)
Make Dataset.map accept list of np.array #2990 (@albertvillanova)
Fix loading compressed CSV without streaming #2994 (@albertvillanova)
Fix json loader when conversion not implemented #3000 (@lhoestq)
Remove all query parameters when extracting protocol #2996 (@albertvillanova)
Correct a typo #3007 (@Yann21)
Fix Windows test suite #3025 (@albertvillanova)
Remove unused parameter in xdirname #3017 (@albertvillanova)
Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
Fix typo #3023 (@qqaatw)
Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
Use cache folder for lockfile #2887 (@Dref360)
Fix streaming: catch Timeout error #3050 (@borisdayma)
Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
Fix task reloading from cache #3059 (@lhoestq)
Fix test command after refac #3065 (@lhoestq)
Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
Update summary on PyPi beyond NLP #3062 (@thomwolf)
Remove a reference to the open Arrow file when deleting a TF dataset created with to_tf_dataset #3002 (@mariosasko)
feat: increase streaming retry config #3068 (@borisdayma)
Fix pathlib patches for streaming #3072 (@lhoestq)

Breaking changes:

Due to the big refactoring at #2986, the prepare_module function doesn't support the return_resolved_file_path and return_associated_base_path parameters. As an alternative, you may use the dataset_module_factory instead.