1.9.0 — Datasets — releases.sh

Datasets Changes

New: C4 #2575 #2592 (@lhoestq)
New: mC4 #2576 (@lhoestq)
New: MasakhaNER #2465 (@dadelani)
New: Eduge #2492 (@enod)
Update: xor_tydi_qa - update version #2455 (@cccntu)
Update: kilt-TriviaQA - original answers #2410 (@PaulLerner)
Update: udpos - change features structure #2466 (@jerryIsHere)
Update: WebNLG - update checksums #2558 (@lhoestq)
Fix: climate fever - adjusting indexing for the labels. #2464 (@drugilsberg)
Fix: proto_qa - fix download link #2463 (@mariosasko)
Fix: ProductReviews - fix label parsing #2530 (@yavuzKomecoglu)
Fix: DROP - fix DuplicatedKeysError #2545 (@albertvillanova)
Fix: code_search_net - fix keys #2555 (@lhoestq)
Fix: discofuse - fix link cc #2541 (@VictorSanh)
Fix: fever - fix keys #2557 (@lhoestq)

Datasets Features

Dataset Streaming #2375 #2582 (@lhoestq)
- Fast download and process your data on-the-fly when iterating over your dataset
- Works with huge datasets like OSCAR, C4, mC4 and hundreds of other datasets
JAX integration #2502 (@lhoestq)
Add Parquet loader + from_parquet and to_parquet #2537 (@lhoestq)
Implement ClassLabel encoding in JSON loader #2468 (@albertvillanova)
Set configurable downloaded datasets path #2488 (@albertvillanova)
Set configurable extracted datasets path #2487 (@albertvillanova)
Add align_labels_with_mapping function #2457 (@lewtun) #2510 (@lhoestq)
Add interleave_datasets for map-style datasets #2568 (@lhoestq)
Add load_dataset_builder #2500 (@mariosasko)
Support Zstandard compressed files #2578 (@albertvillanova)

Task templates

Add task templates for tydiqa and xquad #2518 (@lewtun)
Insert text classification template for Emotion dataset #2521 (@lewtun)
Add summarization template #2529 (@lewtun)
Add task template for automatic speech recognition #2533 (@lewtun)
Remove task templates if required features are removed during Dataset.map #2540 (@lewtun)
Inject templates for ASR datasets #2565 (@lewtun)

General improvements and bug fixes

Allow to use tqdm>=4.50.0 #2482 (@lhoestq)
Use gc.collect only when needed to avoid slow downs #2483 (@lhoestq)
Allow latest pyarrow version #2490 (@albertvillanova)
Use default cast for sliced list arrays if pyarrow >= 4 #2497 (@albertvillanova)
Add Zenodo metadata file with license #2501 (@albertvillanova)
add tensorflow-macos support #2493 (@slayerjain)
Keep original features order #2453 (@albertvillanova)
Add course banner #2506 (@sgugger)
Rearrange JSON field names to match passed features schema field names #2507 (@albertvillanova)
Fix typo in MatthewsCorrelation class name #2517 (@albertvillanova)
Use scikit-learn package rather than sklearn in setup.py #2525 (@lesteve)
Improve performance of pandas arrow extractor #2519 (@albertvillanova)
Fix fingerprint when moving cache dir #2509 (@lhoestq)
Replace bad n>1M size tag #2527 (@lhoestq)
Fix dev version #2531 (@lhoestq)
Sync with transformers disabling NOTSET #2534 (@albertvillanova)
Fix logging levels #2544 (@albertvillanova)
Add support for Split.ALL #2259 (@mariosasko)
Raise FileNotFoundError in WindowsFileLock #2524 (@mariosasko)
Make numpy arrow extractor faster #2505 (@lhoestq)
fix Dataset.map when num_procs > num rows #2566 (@connor-mccarthy)
Add ASR task and new languages to resources #2567 (@lewtun)
Filter expected warning log from transformers #2571 (@albertvillanova)
Fix BibTeX entry #2579 (@albertvillanova)
Fix Counter import #2580 (@albertvillanova)
Add aiohttp to tests extras require #2587 (@albertvillanova)
Add language tags #2590 (@lewtun)
Support pandas 1.3.0 read_csv #2593 (@lhoestq)

Dataset cards

Updated Dataset Description #2420 (@binny-mathew)
Update DatasetMetadata and ReadMe #2436 (@gchhablani)
CRD3 dataset card #2515 (@wilsonyhlee)
Add license to the Cambridge English Write & Improve + LOCNESS dataset card #2546 (@lhoestq)
wi_locness: reference latest leaderboard on codalab #2584 (@aseifert)

Docs

no s at load_datasets #2479 (@julien-c)
Fix docs custom stable version #2477 (@albertvillanova)
Improve Features docs #2535 (@albertvillanova)
Update README.md #2414 (@cryoff)
Fix FileSystems documentation #2551 (@connor-mccarthy)
Minor fix in loading metrics docs #2562 (@albertvillanova)
Minor fix docs format for bertscore #2570 (@albertvillanova)
Add streaming in load a dataset docs #2574 (@lhoestq)