releases.shpreview

1.10.0

$npx -y @buildinternet/releases show rel_xyY59sfnzaYcYfR1g2Qgr

Datasets Features

  • Support remote data files #2616 (@albertvillanova) This allows to pass URLs of remote data files to any dataset loader:
    load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})
    
    This works for all these dataset loaders:
    • text
    • csv
    • json
    • parquet
    • pandas
  • Streaming from remote text/json/csv/parquet/pandas files: When you pass URLs to a dataset loader, you can enable streaming mode with streaming=True. Main contributions:
    • Streaming for the Pandas loader #2636 (@lhoestq)
    • Streaming for the CSV loader #2635 (@lhoestq)
    • Streaming for the Json loader #2608 (@albertvillanova) #2638 (@lhoestq)
  • Faster search_batch for ElasticsearchIndex due to threading #2581 (@mwrzalik)
  • Delete extracted files when loading dataset #2631 (@albertvillanova)

Datasets Changes

  • Fix: C4 - fix expected files list #2682 (@lhoestq)
  • Fix: SQuAD - fix misalignment #2586 (@albertvillanova)
  • Fix: omp - fix DuplicatedKeysError#2603 (@albertvillanova)
  • Fix: wi_locness - potential DuplicatedKeysError #2609 (@albertvillanova)
  • Fix: LibriSpeech - potential DuplicatedKeysError #2672 (@albertvillanova)
  • Fix: SQuAD - potential DuplicatedKeysError #2673 (@albertvillanova)
  • Fix: Blog Authorship Corpus - fix split sizes and text encoding #2685 (@albertvillanova)

Dataset Tasks

  • Add speech processing tasks #2620 (@lewtun)
  • Update ASR tags #2633 (@lewtun)
  • Inject ASR template for lj_speech dataset #2634 (@albertvillanova)
  • Add ASR task for SUPERB #2619 (@lewtun)
  • add image-classification task template #2632 (@nateraw)

Metrics Changes

  • New: wiki_split #2623 (@bhadreshpsavani)
  • Update: accuracy,f1,precision,recall - Support multilabel metrics #2589 (@albertvillanova)
  • Fix: sacrebleu - fix parameter name #2674 (@albertvillanova)

General improvements and bug fixes

  • Fix BibTeX entry #2594 (@albertvillanova)
  • Fix test_is_small_dataset #2588 (@albertvillanova)
  • Remove import of transformers #2602 (@albertvillanova)
  • Make any ClientError trigger retry in streaming mode (e.g. ClientOSError) #2605 (@lhoestq)
  • Fix filter with multiprocessing in case all samples are discarded #2601 (@mxschmdt)
  • Remove redundant prepare_module #2597 (@albertvillanova)
  • Create ExtractManager #2295 (@albertvillanova)
  • Return Python float instead of numpy.float64 in sklearn metrics #2612 (@lewtun)
  • Use ndarray.item instead of ndarray.tolist #2613 (@lewtun)
  • Convert numpy scalar to python float in Pearsonr output #2614 (@lhoestq)
  • Fix missing EOL issue in to_json for old versions of pandas #2617 (@lhoestq)
  • Use correct logger in metrics.py #2626 (@mariosasko)
  • Minor fix tests with Windows paths #2627 (@albertvillanova)
  • Use ETag of remote data files #2628 (@albertvillanova)
  • More consistent naming #2611 (@mariosasko)
  • Refactor patching to specific submodule #2639 (@albertvillanova)
  • Fix docstrings #2640 (@albertvillanova)
  • Fix anchor in README #2647 (@mariosasko)
  • Fix logging docstring #2652 (@mariosasko)
  • Allow dataset config kwargs to be None #2659 (@lhoestq)
  • Use prefix to allow exceed Windows MAX_PATH #2621 (@albertvillanova)
  • Use tqdm from tqdm_utils #2667 (@mariosasko)
  • Increase json reader block_size automatically #2676 (@lhoestq)
  • Parallelize ETag requests #2675 (@lhoestq)
  • Fix bad config ids that name cache directories #2686 (@lhoestq)
  • Minor documentation fix #2687 (@slowwavesleep)

Dataset Cards

  • Add missing WikiANN language tags #2610 (@albertvillanova)
  • feat: 🎸 add paperswithcode id for qasper dataset #2680 (@severo)

Docs

  • Update processing.rst with other export formats #2599 (@TevenLeScao)

Fetched April 7, 2026