releases.shpreview

1.13.0

$npx -y @buildinternet/releases show rel_LcIbmfTYAeAREsj75vNNR

Dataset changes

  • New: CaSiNo #2867 (@kushalchawla)
  • New: Mostly Basic Python Problems #2893 (@lvwerra)
  • New: OpenAI's HumanEval #2897 (@lvwerra)
  • New: SemEval-2018 Task 1: Affect in Tweets #2745 (@maxpel)
  • New: SEDE #2942 (@Hazoom)
  • New: Jigsaw unintended Bias #2935 (@Iwontbecreative)
  • New: AMI #2853 (@cahya-wirawan)
  • New: Math Aptitude Test of Heuristics #2982 #3014 (@hacobe, @albertvillanova)
  • New: SwissJudgmentPrediction #2983 (@JoelNiklaus)
  • New: KanHope #2985 (@adeepH)
  • New: CommonLanguage #2989 #3006 #3003 (@anton-l, @albertvillanova, @jimregan)
  • New: SwedMedNER #2940 (@bwang482)
  • New: SberQuAD #3039 (@Alenush)
  • New: LexGLUE: A Benchmark Dataset for Legal Language Understanding in English #3004 (@iliaschalkidis)
  • New: Greek Legal Code #2966 (@christospi)
  • New: Story Cloze Test #3067 (@zaidalyafeai)
  • Update: SUPERB - add IC, SI, ER tasks #2884 #3009 (@anton-l, @albertvillanova)
  • Update: MENYO-20k - repo has moved, updating URL #2939 (@cdleong)
  • Update: TriviaQA - add web and wiki config #2949 (@shirte)
  • Update: nq_open - Use standard open-domain validation split #3029 (@craffel)
  • Update: MeDAL - Add further description and update download URL #3022 (@xhlulu)
  • Update: Biosses - fix column names #3054 (@bwang482)
  • Fix: scitldr - fix minor URL format #2948 (@albertvillanova)
  • Fix: masakhaner - update JSON metadata #2973 (@albertvillanova)
  • Fix: TriviaQA - fix unfiltered subset #2995 (@lhoestq)
  • Fix: TriviaQA - set writer batch size #2999 (@lhoestq)
  • Fix: LJ Speech - fix Windows paths #3016 (@albertvillanova)
  • Fix: MedDialog - update metadata JSON #3046 (@albertvillanova)

Metric changes

  • Update: meteor - update from nltk update #2946 (@lhoestq)
  • Update: accuracy,f1,glue,indic-glue,pearsonr,prcision,recall-super_glue - Replace item with float in metrics #3012 #3001 (@albertvillanova, @mariosasko)
  • Fix: f1/precision/recall metrics with None average #3008 #2992 (@albertvillanova)
  • Fix meteor metric for version >= 3.6.4 #3056 (@albertvillanova)

Dataset features

  • Use with TensorFlow:
    • Adding to_tf_dataset method #2731 #2931 #2951 #2974 (@Rocketknight1)
  • Better support for ZIP files:
    • Support loading dataset from multiple zipped CSV data files #3021 (@albertvillanova)
    • Load private data files + use glob on ZIP archives for json/csv/etc. module inference #3041 (@lhoestq)
  • Streaming improvements:
    • Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
    • Add remove_columns to IterableDataset #3030 (@cccntu)
    • All the above ZIP features also work in streaming mode
  • New utilities:
    • Add get_dataset_split_names() to get a dataset config's split names #2906 (@severo)
  • Replace script_version with revision #2933 (@albertvillanova)
    • The script_version parameter in load_dataset is now deprecated, in favor of revision
  • Experimental - Create Audio feature type #2324 (@albertvillanova):
    • It allows to automatically decode audio data (mp3, wav, flac, etc.) when examples are accessed

Dataset cards

  • Add arxiv paper inswiss_judgment_prediction dataset card #3026 (@JoelNiklaus)

Documentation

  • Add tutorial for no-code dataset upload #2925 (@stevhliu)

General improvements and bug fixes

  • Fix filter leaking #3019 (@lhoestq)
    • calling filter several times in a row was not returning the right results in 1.12.0 and 1.12.1
  • Update BibTeX entry #2928 (@albertvillanova)
  • Fix exception chaining #2911 (@albertvillanova)
  • Add regression test for null Sequence #2929 (@albertvillanova)
  • Don't use old, incompatible cache for the new filter #2947 (@lhoestq)
  • Fix fn kwargs in filter #2950 (@lhoestq)
  • Use pyarrow.Table.replace_schema_metadata instead of pyarrow.Table.cast #2895 (@arsarabi)
  • Check that array is not Float as nan != nan #2936 (@Iwontbecreative)
  • Fix missing conda deps #2952 (@lhoestq)
  • Update legacy Python image for CI tests in Linux #2955 (@albertvillanova)
  • Support pandas 1.3 new read_csv parameters #2960 (@SBrandeis)
  • Fix CI doc build #2961 (@albertvillanova)
  • Run tests in parallel #2954 (@albertvillanova)
  • Ignore dummy folder and dataset_infos.json #2975 (@Ishan-Kumar2)
  • Take namespace into account in caching #2938 (@lhoestq)
  • Make Dataset.map accept list of np.array #2990 (@albertvillanova)
  • Fix loading compressed CSV without streaming #2994 (@albertvillanova)
  • Fix json loader when conversion not implemented #3000 (@lhoestq)
  • Remove all query parameters when extracting protocol #2996 (@albertvillanova)
  • Correct a typo #3007 (@Yann21)
  • Fix Windows test suite #3025 (@albertvillanova)
  • Remove unused parameter in xdirname #3017 (@albertvillanova)
  • Properly install ruamel-yaml for windows CI #3028 (@lhoestq)
  • Fix typo #3023 (@qqaatw)
  • Extend support for streaming datasets that use glob.glob #3015 (@albertvillanova)
  • Actual "proper" install of ruamel.yaml in the windows CI #3033 (@lhoestq)
  • Use cache folder for lockfile #2887 (@Dref360)
  • Fix streaming: catch Timeout error #3050 (@borisdayma)
  • Refac module factory + avoid etag requests for hub datasets #2986 (@lhoestq)
  • Fix task reloading from cache #3059 (@lhoestq)
  • Fix test command after refac #3065 (@lhoestq)
  • Fix Windows CI with FileNotFoundError when setting up s3_base fixture #3070 (@albertvillanova)
  • Update summary on PyPi beyond NLP #3062 (@thomwolf)
  • Remove a reference to the open Arrow file when deleting a TF dataset created with to_tf_dataset #3002 (@mariosasko)
  • feat: increase streaming retry config #3068 (@borisdayma)
  • Fix pathlib patches for streaming #3072 (@lhoestq)

Breaking changes:

  • Due to the big refactoring at #2986, the prepare_module function doesn't support the return_resolved_file_path and return_associated_base_path parameters. As an alternative, you may use the dataset_module_factory instead.

Fetched April 7, 2026