releases.shpreview

1.8.0

$npx -y @buildinternet/releases show rel_-NnIf9xTXlmSo-WMaCnej

Datasets Changes

  • New: Microsoft CodeXGlue Datasets #2357 (@madlag @ncoop57)
  • New: KLUE benchmark #2416 (@jungwhank)
  • New: HendrycksTest #2370 (@andyzoujm)
  • Update: xor_tydi_qa - update url to v1.1 #2449 (@cccntu)
  • Fix: adversarial_qa - DuplicatedKeysError #2433 (@mariosasko)
  • Fix: bn_hate_speech and covid_tweets_japanese - fix broken URLs for #2445 (@lewtun)
  • Fix: flores - fix download link #2448 (@mariosasko)

Datasets Features

  • Add desc parameter in map for DatasetDict object #2423 (@bhavitvyamalik)
  • Support sliced list arrays in cast #2461 (@lhoestq)
    • Dataset.cast can now change the feature types of Sequence fields
  • Revert default in-memory for small datasets #2460 (@albertvillanova) Breaking:
    • we used to have the datasets IN_MEMORY_MAX_SIZE to 250MB
    • we changed this to zero: by default datasets are loaded from the disk with memory mapping and not copied in memory
    • users can still set keep_in_memory=True when loading a dataset to load it in memory

Datasets Cards

  • adds license information for DailyDialog. #2419 (@aditya2211)
  • add english language tags for ~100 datasets #2442 (@VictorSanh)
  • Add copyright info to MLSUM dataset #2427 (@PhilipMay)
  • Add copyright info for wiki_lingua dataset #2428 (@PhilipMay)
  • Mention that there are no answers in adversarial_qa test set #2451 (@lhoestq)

General improvements and bug fixes

  • Add DOI badge to README #2411 (@albertvillanova)
  • Make datasets PEP-561 compliant #2417 (@SBrandeis)
  • Fix save_to_disk nested features order in dataset_info.json #2422 (@lhoestq)
  • Fix CI six installation on linux #2432 (@lhoestq)
  • Fix Docstring Mistake: dataset vs. metric #2425 (@PhilipMay)
  • Fix NQ features loading: reorder fields of features to match nested fields order in arrow data #2438 (@lhoestq)
  • doc: fix typo HF_MAX_IN_MEMORY_DATASET_SIZE_IN_BYTES #2421 (@borisdayma)
  • add utf-8 while reading README #2418 (@bhavitvyamalik)
  • Better error message when trying to access elements of a DatasetDict without specifying the split #2439 (@lhoestq)
  • Rename config and environment variable for in memory max size #2454 (@albertvillanova)
  • Add version-specific BibTeX #2430 (@albertvillanova)
  • Fix cross-reference typos in documentation #2456 (@albertvillanova)
  • Better error message when using the wrong load_from_disk #2437 (@lhoestq)

Experimental and work in progress: Format a dataset for specific tasks

  • Update text classification template labels in DatasetInfo post_init #2392 (@lewtun)
  • Insert task templates for text classification #2389 (@lewtun)
  • Rename QuestionAnswering template to QuestionAnsweringExtractive #2429 (@lewtun)
  • Insert Extractive QA templates for SQuAD-like datasets #2435 (@lewtun)

Fetched April 7, 2026