releases.shpreview

1.9.0

$npx -y @buildinternet/releases show rel_wTRgvz_IHlZfTwp_J7q-P

Datasets Changes

  • New: C4 #2575 #2592 (@lhoestq)
  • New: mC4 #2576 (@lhoestq)
  • New: MasakhaNER #2465 (@dadelani)
  • New: Eduge #2492 (@enod)
  • Update: xor_tydi_qa - update version #2455 (@cccntu)
  • Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
  • Update: udpos - change features structure #2466 (@jerryIsHere)
  • Update: WebNLG - update checksums #2558 (@lhoestq)
  • Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
  • Fix: proto_qa - fix download link #2463 (@mariosasko)
  • Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
  • Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
  • Fix: code_search_net - fix keys #2555 (@lhoestq)
  • Fix: discofuse - fix link cc #2541 (@VictorSanh)
  • Fix: fever - fix keys #2557 (@lhoestq)

Datasets Features

  • Dataset Streaming #2375 #2582 (@lhoestq)
    • Fast download and process your data on-the-fly when iterating over your dataset
    • Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
  • JAX integration #2502 (@lhoestq)
  • Add Parquet loader + from_parquet and to_parquet #2537 (@lhoestq)
  • Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
  • Set configurable downloaded datasets path #2488 (@albertvillanova)
  • Set configurable extracted datasets path #2487 (@albertvillanova)
  • Add align_labels_with_mapping function #2457 (@lewtun) #2510 (@lhoestq)
  • Add interleave_datasets for map-style datasets #2568 (@lhoestq)
  • Add load_dataset_builder #2500 (@mariosasko)
  • Support Zstandard compressed files #2578 (@albertvillanova)

Task templates

  • Add task templates for tydiqa and xquad #2518 (@lewtun)
  • Insert text classification template for Emotion dataset #2521 (@lewtun)
  • Add summarization template #2529 (@lewtun)
  • Add task template for automatic speech recognition #2533 (@lewtun)
  • Remove task templates if required features are removed during Dataset.map #2540 (@lewtun)
  • Inject templates for ASR datasets #2565 (@lewtun)

General improvements and bug fixes

  • Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
  • Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
  • Allow latest pyarrow version #2490 (@albertvillanova)
  • Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
  • Add Zenodo metadata file with license #2501 (@albertvillanova)
  • add tensorflow-macos support #2493 (@slayerjain)
  • Keep original features order #2453 (@albertvillanova)
  • Add course banner #2506 (@sgugger)
  • Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
  • Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
  • Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
  • Improve performance of pandas arrow extractor #2519 (@albertvillanova)
  • Fix fingerprint when moving cache dir #2509 (@lhoestq)
  • Replace bad n>1M size tag #2527 (@lhoestq)
  • Fix dev version #2531 (@lhoestq)
  • Sync with transformers disabling NOTSET #2534 (@albertvillanova)
  • Fix logging levels #2544 (@albertvillanova)
  • Add support for Split.ALL #2259 (@mariosasko)
  • Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
  • Make numpy arrow extractor faster #2505 (@lhoestq)
  • fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
  • Add ASR task and new languages to resources #2567 (@lewtun)
  • Filter expected warning log from transformers #2571 (@albertvillanova)
  • Fix BibTeX entry #2579 (@albertvillanova)
  • Fix Counter import #2580 (@albertvillanova)
  • Add aiohttp to tests extras require #2587 (@albertvillanova)
  • Add language tags #2590 (@lewtun)
  • Support pandas 1.3.0 read_csv #2593 (@lhoestq)

Dataset cards

  • Updated Dataset Description #2420 (@binny-mathew)
  • Update DatasetMetadata and ReadMe #2436 (@gchhablani)
  • CRD3 dataset card #2515 (@wilsonyhlee)
  • Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
  • wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)

Docs

  • no s at load_datasets #2479 (@julien-c)
  • Fix docs custom stable version #2477 (@albertvillanova)
  • Improve Features docs #2535 (@albertvillanova)
  • Update README.md #2414 (@cryoff)
  • Fix FileSystems documentation #2551 (@connor-mccarthy)
  • Minor fix in loading metrics docs #2562 (@albertvillanova)
  • Minor fix docs format for bertscore #2570 (@albertvillanova)
  • Add streaming in load a dataset docs #2574 (@lhoestq)

Fetched April 7, 2026