releases.shpreview
Hugging Face/Datasets

Datasets

$npx -y @buildinternet/releases show datasets
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases8Avg3/moVersionsv4.5.0 → v4.8.3
Dec 21, 2021

Dataset Changes

Dataset Features

Dataset cards

Dataset Tasks

Metric Changes

Docs

Additional improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/1.16.1...1.17.0

Nov 26, 2021

Bug fixes

Datasets Changes

Datasets Features

Dataset Cards

Metrics Changes

Documentation

Additional improvements and bug fixes

Citation

Deprecations

Full Changelog: https://github.com/huggingface/datasets/compare/1.15.1...1.16.0

Nov 2, 2021

Dependencies

Dataset Changes

Dataset Features

Dataset Cards

Metrics Changes

General improvements and bug fixes

Oct 19, 2021

Dataset changes

  • Update: LexGLUE and MultiEURLEX README - update dataset cards #3075 (@iliaschalkidis)
  • Update: SUPERB - use Audio features #3101 (@anton-l)
  • Fix: Blog Authorship Corpus - fix URLs #3106 (@albertvillanova)

Dataset features

  • Add iter_archive #3066 (@lhoestq)

General improvements and bug fixes

  • Replace FSTimeoutError with parent TimeoutError #3100 (@albertvillanova)
  • Fix project description in PyPI #3103 (@albertvillanova)
  • Align tqdm control with cache control #3031 (@mariosasko)
  • Add paper BibTeX citation #3107 (@albertvillanova)
Oct 15, 2021

Dataset changes

  • Update: Adapt all audio datasets #3081 (@patrickvonplaten)

Bug fixes

  • Update BibTeX entry #3090 (@albertvillanova)
  • Use template column_mapping to transmit_format instead of template features #3088 (@mariosasko)
  • Fix Audio feature mp3 resampling #3096 (@albertvillanova)
Oct 14, 2021

Bug fixes

  • Fix error related to huggingface_hub timeout parameter #3082 (@albertvillanova)
  • Remove _resampler from Audio fields #3086 (@albertvillanova)

Bug fixes

  • Fix loading a metric with internal import #3077 (@albertvillanova)
Oct 13, 2021

Dataset changes

  • New: CaSiNo #2867 (@kushalchawla)
  • New: Mostly Basic Python Problems #2893 (@lvwerra)
  • New: OpenAI's HumanEval #2897 (@lvwerra)
  • New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
  • New: SEDE #2942 (@Hazoom)
  • New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
  • New: AMI #2853 (@cahya-wirawan)
  • New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
  • New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
  • New: KanHope #2985 (@adeepH)
  • New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
  • New: SwedMedNER #2940 (@bwang482)
  • New: SberQuAD #3039 (@Alenush)
  • New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
  • New: Greek Legal Code #2966 (@christospi)
  • New: Story Cloze Test #3067 (@zaidalyafeai)
  • Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
  • Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
  • Update: TriviaQA - add web and wiki config #2949 (@shirte)
  • Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
  • Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
  • Update: Biosses - fix column names #3054 (@bwang482)
  • Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
  • Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
  • Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
  • Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
  • Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
  • Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)

Metric changes

  • Update: meteor - update from nltk update #2946 (@lhoestq)
  • Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
  • Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
  • Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)

Dataset features

  • Use with TensorFlow:
    • Adding to_tf_dataset method #2731 #2931 #2951 #2974 (@Rocketknight1)
  • Better support for ZIP files:
    • Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
    • Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
  • Streaming improvements:
    • Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
    • Add remove_columns to IterableDataset #3030 (@cccntu)
    • All the above ZIP features also work in streaming mode
  • New utilities:
    • Add get_dataset_split_names() to get a dataset config's split names #2906 (@severo)
  • Replace script_version with revision #2933 (@albertvillanova)
    • The script_version parameter in load_dataset is now deprecated, in favor of revision
  • Experimental - Create Audio feature type #2324 (@albertvillanova):
    • It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed

Dataset cards

  • Add arxiv paper inswiss_judgment_prediction dataset card #3026 (@JoelNiklaus)

Documentation

  • Add tutorial for no-code dataset upload #2925 (@stevhliu)

General improvements and bug fixes

  • Fix filter leaking #3019 (@lhoestq)
    • calling filter several times in a row was not returning the right results in 1.12.0 and 1.12.1
  • Update BibTeX entry #2928 (@albertvillanova)
  • Fix exception chaining #2911 (@albertvillanova)
  • Add regression test for null Sequence #2929 (@albertvillanova)
  • Don't use old, incompatible cache for the new filter #2947 (@lhoestq)
  • Fix fn kwargs in filter #2950 (@lhoestq)
  • Use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast #2895 (@arsarabi)
  • Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
  • Fix missing conda deps #2952 (@lhoestq)
  • Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
  • Support pandas 1.3 new read_csv parameters #2960 (@SBrandeis)
  • Fix CI doc build #2961 (@albertvillanova)
  • Run tests in parallel #2954 (@albertvillanova)
  • Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
  • Take namespace into account in caching #2938 (@lhoestq)
  • Make Dataset.map accept list of np.array #2990 (@albertvillanova)
  • Fix loading compressed CSV without streaming #2994 (@albertvillanova)
  • Fix json loader when conversion not implemented #3000 (@lhoestq)
  • Remove all query parameters when extracting protocol #2996 (@albertvillanova)
  • Correct a typo #3007 (@Yann21)
  • Fix Windows test suite #3025 (@albertvillanova)
  • Remove unused parameter in xdirname #3017 (@albertvillanova)
  • Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
  • Fix typo #3023 (@qqaatw)
  • Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
  • Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
  • Use cache folder for lockfile #2887 (@Dref360)
  • Fix streaming: catch Timeout error #3050 (@borisdayma)
  • Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
  • Fix task reloading from cache #3059 (@lhoestq)
  • Fix test command after refac #3065 (@lhoestq)
  • Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
  • Update summary on PyPi beyond NLP #3062 (@thomwolf)
  • Remove a reference to the open Arrow file when deleting a TF dataset created with to_tf_dataset #3002 (@mariosasko)
  • feat: increase streaming retry config #3068 (@borisdayma)
  • Fix pathlib patches for streaming #3072 (@lhoestq)

Breaking changes:

  • Due to the big refactoring at #2986, the prepare_module function doesn't support the return_resolved_file_path and return_associated_base_path parameters. As an alternative, you may use the dataset_module_factory instead.
Sep 15, 2021

Bug fixes

  • Fix fsspec AbstractFileSystem access #2915 (@pierre-godard)
  • Fix unwanted tqdm bar when accessing examples #2920 (@lhoestq)
  • Fix conversion of multidim arrays in list to arrow #2922 (@lhoestq):
    • this fixes the ArrowInvalid: Can only convert 1-dimensional array values errors
Sep 13, 2021

New documentation

  • New documentation structure #2718 (@stevhliu):
    • New: Tutorials
    • New: Hot-to guides
    • New: Conceptual guides
    • Update: Reference

See the new documentation here !

Datasets changes

  • New: VIVOS dataset for Vietnamese ASR #2780 (@binh234)
  • New: The Pile books3 #2801 (@richarddwang)
  • New: The Pile stack exchange #2803 (@richarddwang)
  • New: The Pile openwebtext2 #2802 (@richarddwang)
  • New: Food-101 #2804 (@nateraw)
  • New: Beans #2809 (@nateraw)
  • New: cedr #2796 (@naumov-al)
  • New: cats_vs_dogs #2807 (@nateraw)
  • New: MultiEURLEX #2865 (@iliaschalkidis)
  • New: BIOSSES #2881 (@bwang482)
  • Update: TTC4900 - add download URL #2732 (@yavuzKomecoglu)
  • Update: Wikihow - Generate metadata JSON for wikihow dataset #2748 (@albertvillanova)
  • Update: lm1b - Generate metadata JSON #2752 (@albertvillanova)
  • Update: reclor - Generate metadata JSON #2753 (@albertvillanova)
  • Update: telugu_books - Generate metadata JSON #2754 (@albertvillanova)
  • Update: SUPERB - Add SD task #2661 (@albertvillanova)
  • Update: SUPERB - Add KS task #2783 (@anton-l)
  • Update: GooAQ - add train/val/test splits #2792 (@bhavitvyamalik)
  • Update: Openwebtext - update size #2857 (@lhoestq)
  • Update: timit_asr - make the dataset streamable #2835 (@lhoestq)
  • Fix: journalists_questions -fix key by recreating metadata JSON #2744 (@albertvillanova)
  • Fix: turkish_movie_sentiment - fix metadata JSON #2755 (@albertvillanova)
  • Fix: ubuntu_dialogs_corpus - fix metadata JSON #2756 (@albertvillanova)
  • Fix: CNN/DailyMail - typo #2791 (@omaralsayed)
  • Fix: linnaeus - fix url #2852 (@lhoestq)
  • Fix ToTTo - fix data URL #2864 (@albertvillanova)
  • Fix: wikicorpus - fix keys #2844 (@lhoestq)
  • Fix: COUNTER - fix bad file name #2894 (@albertvillanova)
  • Fix: DocRED - fix data URLs and metadata #2883 (@albertvillanova)

Datasets features

  • Load Dataset from the Hub (NO DATASET SCRIPT) #2662 (@lhoestq)
  • Preserve dtype for numpy/torch/tf/jax arrays #2361 (@bhavitvyamalik)
  • add multi-proc in to_json #2747 (@bhavitvyamalik)
  • Optimize Dataset.filter to only compute the indices to keep #2836 (@lhoestq)

Dataset streaming - better support for compression:

  • Fix streaming zip files #2798 (@albertvillanova)
  • Support streaming tar files #2800 (@albertvillanova)
  • Support streaming compressed files (gzip, bz2, lz4, xz, zst) #2786 (@albertvillanova)
  • Fix streaming zip files from canonical datasets #2805 (@albertvillanova)
  • Add url prefix convention for many compression formats #2822 (@lhoestq)
  • Support streaming datasets that use pathlib #2874 (@albertvillanova)
  • Extend support for streaming datasets that use pathlib.Path stem/suffix #2880 (@albertvillanova)
  • Extend support for streaming datasets that use pathlib.Path.glob #2876 (@albertvillanova)

Metrics changes

  • Update: BERTScore - Add support for fast tokenizer #2770 (@mariosasko)
  • Fix: Sacrebleu - Fix sacrebleu tokenizers #2739 #2778 #2779 (@albertvillanova)

Dataset cards

  • Updated dataset description of DaNE #2789 (@KennethEnevoldsen)
  • Update ELI5 README.md #2848 (@odellus)

General improvements and bug fixes

  • Update release instructions #2740 (@albertvillanova)
  • Raise ManualDownloadError when loading a dataset that requires previous manual download #2758 (@albertvillanova)
  • Allow PyArrow from source #2769 (@patrickvonplaten)
  • fix typo (ShuffingConfig -> ShufflingConfig) #2766 (@daleevans)
  • Fix typo in test_dataset_common #2790 (@nateraw)
  • Fix type hint for data_files #2793 (@albertvillanova)
  • Bump tqdm version #2814 (@mariosasko)
  • Use packaging to handle versions #2777 (@albertvillanova)
  • Tiny typo fixes of "fo" -> "of" #2815 (@aronszanto)
  • Rename The Pile subsets #2817 (@lhoestq)
  • Fix IndexError by ignoring empty RecordBatch #2834 (@lhoestq)
  • Fix defaults in cache_dir docstring in load.py #2824 (@mariosasko)
  • Fix extraction protocol inference from urls with params #2843 (@lhoestq)
  • Fix caching when moving script #2854 (@lhoestq)
  • Fix windows CI CondaError #2855 (@lhoestq)
  • fix: 🐛 remove URL's query string only if it's ?dl=1 #2856 (@severo)
  • Update column_names showed as :func: in exploring.st #2851 (@ClementRomac)
  • Fix s3fs version in CI #2858 (@lhoestq)
  • Fix three typos in two files for documentation #2870 (@leny-mi)
  • Move checks from _map_single to map #2660 (@mariosasko)
  • fix regex to accept negative timezone #2847 (@jadermcs)
  • Prevent .map from using multiprocessing when loading from cache #2774 (@thomasw21)
  • Fix null sequence encoding #2900 (@lhoestq)
Jul 30, 2021

Datasets Changes

  • New: Add Russian SuperGLUE #2668 (@slowwavesleep)
  • New: Add Disfl-QA #2473 (@bhavitvyamalik)
  • New: Add TimeDial #2476 (@bhavitvyamalik)
  • Fix: Enumerate all ner_tags values in WNUT 17 dataset #2713 (@albertvillanova)
  • Fix: Update WikiANN data URL #2710 (@albertvillanova)
  • Fix: Update PAN-X data URL in XTREME dataset #2715 (@albertvillanova)
  • Fix: C4 - en subset by modifying dataset_info with correct validation infos #2723 (@thomasw21)

General improvements and bug fixes

  • fix: 🐛 change string format to allow copy/paste to work in bash #2694 (@severo)
  • Update BibTeX entry #2706 (@albertvillanova)
  • Print absolute local paths in load_dataset error messages #2684 (@mariosasko)
  • Add support for disable_progress_bar on Windows #2696 (@mariosasko)
  • Ignore empty batch when writing #2698 (@pcuenca)
  • Fix shuffle on IterableDataset that disables batching in case any functions were mapped #2717 (@amankhandelia)
  • fix: 🐛 fix two typos #2720 (@severo)
  • Docs details #2690 (@severo)
  • Deal with the bad check in test_load.py #2721 (@mariosasko)
  • Pass use_auth_token to request_etags #2725 (@albertvillanova)
  • Typo fix tokenize_exemple #2726 (@shabie)
  • Fix IndexError while loading Arabic Billion Words dataset #2729 (@albertvillanova)
  • Add missing parquet known extension #2733 (@lhoestq)
Jul 22, 2021

The error message to tell which dataset config name to load was not displayed:

  • Fix pick default config name message #2704 (@lhoestq)

Docstrings:

  • Fix download_mode docstrings #2701 (@albertvillanova)
  • Fix minimum tqdm version and import on Colab #2697 (@nateraw)
  • Fix OSCAR Esperanto #2693 (@lhoestq)
Jul 21, 2021

Datasets Features

  • Support remote data files #2616 (@albertvillanova) This allows to pass URLs of remote data files to any dataset loader:
    load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})
    
    This works for all these dataset loaders:
    • text
    • csv
    • json
    • parquet
    • pandas
  • Streaming from remote text/json/csv/parquet/pandas files: When you pass URLs to a dataset loader, you can enable streaming mode with streaming=True. Main contributions:
    • Streaming for the Pandas loader #2636 (@lhoestq)
    • Streaming for the CSV loader #2635 (@lhoestq)
    • Streaming for the Json loader #2608 (@albertvillanova) #2638 (@lhoestq)
  • Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzalik)
  • Delete extracted files when loading dataset #2631 (@albertvillanova)

Datasets Changes

  • Fix: C4 - fix expected files list #2682 (@lhoestq)
  • Fix: SQuAD - fix misalignment #2586 (@albertvillanova)
  • Fix: omp - fix DuplicatedKeysError#2603 (@albertvillanova)
  • Fix: wi_locness - potential DuplicatedKeysError #2609 (@albertvillanova)
  • Fix: LibriSpeech - potential DuplicatedKeysError #2672 (@albertvillanova)
  • Fix: SQuAD - potential DuplicatedKeysError #2673 (@albertvillanova)
  • Fix: Blog Authorship Corpus - fix split sizes and text encoding #2685 (@albertvillanova)

Dataset Tasks

  • Add speech processing tasks #2620 (@lewtun)
  • Update ASR tags #2633 (@lewtun)
  • Inject ASR template for lj_speech dataset #2634 (@albertvillanova)
  • Add ASR task for SUPERB #2619 (@lewtun)
  • add image-classification task template #2632 (@nateraw)

Metrics Changes

  • New: wiki_split #2623 (@bhadreshpsavani)
  • Update: accuracy,f1,precision,recall - Support multilabel metrics #2589 (@albertvillanova)
  • Fix: sacrebleu - fix parameter name #2674 (@albertvillanova)

General improvements and bug fixes

  • Fix BibTeX entry #2594 (@albertvillanova)
  • Fix test_is_small_dataset #2588 (@albertvillanova)
  • Remove import of transformers #2602 (@albertvillanova)
  • Make any ClientError trigger retry in streaming mode (e.g. ClientOSError) #2605 (@lhoestq)
  • Fix filter with multiprocessing in case all samples are discarded #2601 (@mxschmdt)
  • Remove redundant prepare_module #2597 (@albertvillanova)
  • Create ExtractManager #2295 (@albertvillanova)
  • Return Python float instead of numpy.float64 in sklearn metrics #2612 (@lewtun)
  • Use ndarray.item instead of ndarray.tolist #2613 (@lewtun)
  • Convert numpy scalar to python float in Pearsonr output #2614 (@lhoestq)
  • Fix missing EOL issue in to_json for old versions of pandas #2617 (@lhoestq)
  • Use correct logger in metrics.py #2626 (@mariosasko)
  • Minor fix tests with Windows paths #2627 (@albertvillanova)
  • Use ETag of remote data files #2628 (@albertvillanova)
  • More consistent naming #2611 (@mariosasko)
  • Refactor patching to specific submodule #2639 (@albertvillanova)
  • Fix docstrings #2640 (@albertvillanova)
  • Fix anchor in README #2647 (@mariosasko)
  • Fix logging docstring #2652 (@mariosasko)
  • Allow dataset config kwargs to be None #2659 (@lhoestq)
  • Use prefix to allow exceed Windows MAX_PATH #2621 (@albertvillanova)
  • Use tqdm from tqdm_utils #2667 (@mariosasko)
  • Increase json reader block_size automatically #2676 (@lhoestq)
  • Parallelize ETag requests #2675 (@lhoestq)
  • Fix bad config ids that name cache directories #2686 (@lhoestq)
  • Minor documentation fix #2687 (@slowwavesleep)

Dataset Cards

  • Add missing WikiANN language tags #2610 (@albertvillanova)
  • feat: 🎸 add paperswithcode id for qasper dataset #2680 (@severo)

Docs

  • Update processing.rst with other export formats #2599 (@TevenLeScao)
Jul 5, 2021

Datasets Changes

  • New: C4 #2575 #2592 (@lhoestq)
  • New: mC4 #2576 (@lhoestq)
  • New: MasakhaNER #2465 (@dadelani)
  • New: Eduge #2492 (@enod)
  • Update: xor_tydi_qa - update version #2455 (@cccntu)
  • Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
  • Update: udpos - change features structure #2466 (@jerryIsHere)
  • Update: WebNLG - update checksums #2558 (@lhoestq)
  • Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
  • Fix: proto_qa - fix download link #2463 (@mariosasko)
  • Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
  • Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
  • Fix: code_search_net - fix keys #2555 (@lhoestq)
  • Fix: discofuse - fix link cc #2541 (@VictorSanh)
  • Fix: fever - fix keys #2557 (@lhoestq)

Datasets Features

  • Dataset Streaming #2375 #2582 (@lhoestq)
    • Fast download and process your data on-the-fly when iterating over your dataset
    • Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
  • JAX integration #2502 (@lhoestq)
  • Add Parquet loader + from_parquet and to_parquet #2537 (@lhoestq)
  • Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
  • Set configurable downloaded datasets path #2488 (@albertvillanova)
  • Set configurable extracted datasets path #2487 (@albertvillanova)
  • Add align_labels_with_mapping function #2457 (@lewtun) #2510 (@lhoestq)
  • Add interleave_datasets for map-style datasets #2568 (@lhoestq)
  • Add load_dataset_builder #2500 (@mariosasko)
  • Support Zstandard compressed files #2578 (@albertvillanova)

Task templates

  • Add task templates for tydiqa and xquad #2518 (@lewtun)
  • Insert text classification template for Emotion dataset #2521 (@lewtun)
  • Add summarization template #2529 (@lewtun)
  • Add task template for automatic speech recognition #2533 (@lewtun)
  • Remove task templates if required features are removed during Dataset.map #2540 (@lewtun)
  • Inject templates for ASR datasets #2565 (@lewtun)

General improvements and bug fixes

  • Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
  • Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
  • Allow latest pyarrow version #2490 (@albertvillanova)
  • Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
  • Add Zenodo metadata file with license #2501 (@albertvillanova)
  • add tensorflow-macos support #2493 (@slayerjain)
  • Keep original features order #2453 (@albertvillanova)
  • Add course banner #2506 (@sgugger)
  • Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
  • Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
  • Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
  • Improve performance of pandas arrow extractor #2519 (@albertvillanova)
  • Fix fingerprint when moving cache dir #2509 (@lhoestq)
  • Replace bad n>1M size tag #2527 (@lhoestq)
  • Fix dev version #2531 (@lhoestq)
  • Sync with transformers disabling NOTSET #2534 (@albertvillanova)
  • Fix logging levels #2544 (@albertvillanova)
  • Add support for Split.ALL #2259 (@mariosasko)
  • Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
  • Make numpy arrow extractor faster #2505 (@lhoestq)
  • fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
  • Add ASR task and new languages to resources #2567 (@lewtun)
  • Filter expected warning log from transformers #2571 (@albertvillanova)
  • Fix BibTeX entry #2579 (@albertvillanova)
  • Fix Counter import #2580 (@albertvillanova)
  • Add aiohttp to tests extras require #2587 (@albertvillanova)
  • Add language tags #2590 (@lewtun)
  • Support pandas 1.3.0 read_csv #2593 (@lhoestq)

Dataset cards

  • Updated Dataset Description #2420 (@binny-mathew)
  • Update DatasetMetadata and ReadMe #2436 (@gchhablani)
  • CRD3 dataset card #2515 (@wilsonyhlee)
  • Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
  • wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)

Docs

  • no s at load_datasets #2479 (@julien-c)
  • Fix docs custom stable version #2477 (@albertvillanova)
  • Improve Features docs #2535 (@albertvillanova)
  • Update README.md #2414 (@cryoff)
  • Fix FileSystems documentation #2551 (@connor-mccarthy)
  • Minor fix in loading metrics docs #2562 (@albertvillanova)
  • Minor fix docs format for bertscore #2570 (@albertvillanova)
  • Add streaming in load a dataset docs #2574 (@lhoestq)
Jun 8, 2021

Datasets Changes

  • New: Microsoft CodeXGlue Datasets #2357 (@madlag @ncoop57)
  • New: KLUE benchmark #2416 (@jungwhank)
  • New: HendrycksTest #2370 (@andyzoujm)
  • Update: xor_tydi_qa - update url to v1.1 #2449 (@cccntu)
  • Fix: adversarial_qa - DuplicatedKeysError #2433 (@mariosasko)
  • Fix: bn_hate_speech and covid_tweets_japanese - fix broken URLs for #2445 (@lewtun)
  • Fix: flores - fix download link #2448 (@mariosasko)

Datasets Features

  • Add desc parameter in map for DatasetDict object #2423 (@bhavitvyamalik)
  • Support sliced list arrays in cast #2461 (@lhoestq)
    • Dataset.cast can now change the feature types of Sequence fields
  • Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
    • we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
    • we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
    • users can still set keep_in_memory=True when loading a dataset to load it in memory

Datasets Cards

  • adds license information for DailyDialog. #2419 (@aditya2211)
  • add english language tags for ~100 datasets #2442 (@VictorSanh)
  • Add copyright info to MLSUM dataset #2427 (@PhilipMay)
  • Add copyright info for wiki_lingua dataset #2428 (@PhilipMay)
  • Mention that there are no answers in adversarial_qa test set #2451 (@lhoestq)

General improvements and bug fixes

  • Add DOI badge to README #2411 (@albertvillanova)
  • Make datasets PEP-561 compliant #2417 (@SBrandeis)
  • Fix save_to_disk nested features order in dataset_info.json #2422 (@lhoestq)
  • Fix CI six installation on linux #2432 (@lhoestq)
  • Fix Docstring Mistake: dataset vs. metric #2425 (@PhilipMay)
  • Fix NQ features loading: reorder fields of features to match nested fields order in arrow data #2438 (@lhoestq)
  • doc: fix typo HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2421 (@borisdayma)
  • add utf-8 while reading README #2418 (@bhavitvyamalik)
  • Better error message when trying to access elements of a DatasetDict without specifying the split #2439 (@lhoestq)
  • Rename config and environment variable for in memory max size #2454 (@albertvillanova)
  • Add version-specific BibTeX #2430 (@albertvillanova)
  • Fix cross-reference typos in documentation #2456 (@albertvillanova)
  • Better error message when using the wrong load_from_disk #2437 (@lhoestq)

Experimental and work in progress: Format a dataset for specific tasks

  • Update text classification template labels in DatasetInfo post_init #2392 (@lewtun)
  • Insert task templates for text classification #2389 (@lewtun)
  • Rename QuestionAnswering template to QuestionAnsweringExtractive #2429 (@lewtun)
  • Insert Extractive QA templates for SQuAD-like datasets #2435 (@lewtun)
May 27, 2021

Dataset Changes

  • New: NLU evaluation data #2238 (@dkajtoch)
  • New: Add SLR32, SLR52, SLR53 to OpenSLR #2241, #2311 (@cahya-wirawan)
  • New: Bbaw egyptian #2290 (@phiwi)
  • New: GooAQ #2260 (@bhavitvyamalik)
  • New: SubjQA #2302 (@lewtun)
  • New: Ascent KB #2341, #2349 (@phongnt570)
  • New: HLGD #2325 (@tingofurro)
  • New: Qasper #2346 (@cceyda)
  • New: ConvQuestions benchmark #2372 (@PhilippChr)
  • Update: Wikihow - Clarify how to load wikihow #2240 (@albertvillanova)
  • Update multi_woz_v22 - update checksum #2281 (@lhoestq)
  • Update: OSCAR - Set encoding in OSCAR dataset #2321 (@albertvillanova)
  • Update: XTREME - Enable auto-download for PAN-X / Wikiann domain in XTREME #2326 (@lewtun)
  • Update: GEM - the DART file checksums in GEM #2334 (@yjernite)
  • Update: web_science - fixed download link #2338 (@bhavitvyamalik)
  • Update: SNLI, MNLI- README updated for SNLI, MNLI #2364 (@bhavitvyamalik)
  • Update: conll2003 - correct labels #2369 (@philschmid)
  • Update: offenseval_dravidian - update citations #2385 (@adeepH)
  • Update: ai2_arc - Add dataset tags #2405 (@OyvindTafjord)
  • Fix: newsph_nli - test data added, dataset_infos updated #2263 (@bhavitvyamalik)
  • Fix: hyperpartisan news detection - Remove getchildren #2367 (@ghomasHudson)
  • Fix: indic_glue - Fix number of classes in indic_glue sna.bn dataset #2397 (@albertvillanova)
  • Fix: head_qa - Fix keys #2408 (@lhoestq)

Dataset Features

  • Implement Dataset add_item #1870 (@albertvillanova)
  • Implement Dataset add_column #2145 (@albertvillanova)
  • Implement Dataset to JSON #2248, #2352 (@albertvillanova)
  • Add rename_columnS method #2312 (@SBrandeis)
  • add desc to tqdm in Dataset.map() #2374 (@bhavitvyamalik)
  • Add env variable HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2399, #2409 (@albertvillanova)

Metric Changes

  • New: CUAD metrics #2273 (@bhavitvyamalik)
  • New: Matthews/Pearson/Spearman correlation metrics #2328 (@lhoestq)
  • Update: CER - Docs, CER above 1 #2342 (@borisdayma)

General improvements and bug fixes

  • Update black #2265 (@lhoestq)
  • Fix incorrect update_metadata_with_features calls in ArrowDataset #2258 (@mariosasko)
  • Faster map w/ input_columns & faster slicing w/ Iterable keys #2246 (@norabelrose)
  • Don't use pyarrow 4.0.0 since it segfaults when casting a sliced ListArray of integers #2268 (@lhoestq)
  • Fix query table with iterable #2269 (@lhoestq)
  • Perform minor refactoring: use config #2253 (@albertvillanova)
  • Update format, fingerprint and indices after add_item #2254 (@lhoestq)
  • Always update metadata in arrow schema #2274 (@lhoestq)
  • Make tests run faster #2266 (@lhoestq)
  • Fix metadata validation with config names #2286 (@lhoestq)
  • Fixed typo seperate->separate #2292 (@laksh9950)
  • Allow collaborators to self-assign issues #2289 (@albertvillanova)
  • Mapping in the distributed setting #2298 (@TevenLeScao)
  • Fix conda release #2309 (@lhoestq)
  • Fix incorrect version specification for the pyarrow package #2317 (@cemilcengiz)
  • Set default name in init_dynamic_modules #2320 (@albertvillanova)
  • Fix duplicate keys #2333 (@lhoestq)
  • Add note about indices mapping in save_to_disk docstring #2332 (@lhoestq)
  • Metadata validation #2107 (@theo-m)
  • Add Validation For README #2121 (@gchhablani)
  • Fix overflow issue in interpolation search #2336 (@mariosasko)
  • Datasets cli improvements #2315 (@mariosasko)
  • Add key type and duplicates verification with hashing #2245 (@NikhilBartwal)
  • More consistent copy logic #2340 (@mariosasko)
  • Update README vallidation rules #2353 (@gchhablani)
  • normalized TOCs and titles in data cards #2355 (@yjernite)
  • simpllify faiss index save #2351 (@Guitaricet)
  • Allow "other-X" in licenses #2368 (@gchhablani)
  • Improve ReadInstruction logic and update docs #2261 (@mariosasko)
  • Disallow duplicate keys in yaml tags #2379 (@lhoestq)
  • maintain YAML structure reading from README #2380 (@bhavitvyamalik)
  • add dataset card title #2381 (@bhavitvyamalik)
  • Add tests for dataset cards #2348 (@gchhablani)
  • Improve example in rounding docs #2383 (@mariosasko)
  • Paperswithcode dataset mapping #2404 (@julien-c)
  • Free datasets with cache file in temp dir on exit #2403 (@mariosasko)

Experimental and work in progress: Format a dataset for specific tasks

  • Task formatting for text classification & question answering #2255 (@SBrandeis)
  • Add check for task templates on dataset load #2390 (@lewtun)
  • Add args description to DatasetInfo #2384 (@lewtun)
  • Improve task api code quality #2376 (@mariosasko)
Apr 30, 2021

Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq) This affected methods like concatenate_datasets, multiprocessed map and load_from_disk.

Breaking change:

  • when using Dataset.map with the input_columns parameter, the resulting dataset will only have the columns from input_columns and the columns added by the map functions. The other columns are discarded.
Latest
4.8.4
Tracking Since
Apr 30, 2021
Last checked Apr 20, 2026