releases.shpreview
Hugging Face/Datasets

Datasets

$npx -y @buildinternet/releases show datasets
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases8Avg3/moVersionsv4.5.0 → v4.8.3
Mar 17, 2025
Mar 14, 2025

Dataset Features

  • Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in https://github.com/huggingface/datasets/pull/7424

    • /!\ Breaking change: we replaced decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this version
    from datasets import load_dataset, Video
    
    dataset = load_dataset("path/to/video/folder", split="train")
    dataset[0]["video"]  # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
    
    • faster streaming for image/audio/video folder from Hugging Face
    • support for metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video files
  • Add IterableDataset.decode with multithreading by @lhoestq in https://github.com/huggingface/datasets/pull/7450

    • even faster streaming for image/audio/video folder from Hugging Face if you enable multithreading to decode image/audio/video data:
    dataset = dataset.decode(num_threads=num_threads)
    
  • Add with_split to DatasetDict.map by @jp1924 in https://github.com/huggingface/datasets/pull/7368

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.3.2...3.4.0

Feb 20, 2025

Bug fixes

Other general improvements

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.3.1...3.3.2

Feb 17, 2025
Feb 14, 2025

Dataset Features

  • Support async functions in map() by @lhoestq in https://github.com/huggingface/datasets/pull/7384

    • Especially useful to download content like images or call inference APIs
    prompt = "Answer the following question: {question}. You should think step by step."
    async def ask_llm(example):
        return await query_model(prompt.format(question=example["question"]))
    ds = ds.map(ask_llm)
    
  • Add repeat method to datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7198

    ds = ds.repeat(10)
    
  • Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in https://github.com/huggingface/datasets/pull/7370

    • Add support for "pandas" and "polars" formats in IterableDatasets
    • This enables optimized data processing using pandas or polars functions with zero-copy, e.g.
    ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
    ds = ds.with_format("polars")
    expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
    ds = ds.map(lambda df: df.with_columns(expr), batched=True)
    
  • Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7207

    • IterableDatasets with "numpy" format are now much faster

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.2.0...3.3.0

Dec 10, 2024

Dataset Features

  • Faster parquet streaming + filters with predicate pushdown by @lhoestq in https://github.com/huggingface/datasets/pull/7309
    • Up to +100% streaming speed
    • Fast filtering via predicate pushdown (skip files/row groups based on predicate instead of downloading the full data), e.g.
      from datasets import load_dataset
      filters = [('date', '>=', '2023')]
      ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
      

Other improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.1.0...3.2.0

Oct 31, 2024

Dataset Features

  • Video support by @lhoestq in https://github.com/huggingface/datasets/pull/7230
    >>> from datasets import Dataset, Video, load_dataset
    >>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video())
    >>> # or from the hub
    >>> ds = load_dataset("username/dataset_name", split="train")
    >>> ds[0]["video"]
    <decord.video_reader.VideoReader at 0x105525c70>
    
  • Add IterableDataset.shard() by @lhoestq in https://github.com/huggingface/datasets/pull/7252
    >>> from datasets import load_dataset
    >>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True)
    >>> full_ds.num_shards
    2360
    >>> ds = full_ds.shard(num_shards=ds.num_shards, index=0)
    >>> ds.num_shards
    1
    >>> ds = full_ds.shard(num_shards=8, index=0)
    >>> ds.num_shards
    295
    
  • Basic XML support by @lhoestq in https://github.com/huggingface/datasets/pull/7250

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.0.2...3.1.0

Oct 22, 2024

Main bug fixes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.0.1...3.0.2

Sep 26, 2024

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.0.0...3.0.1

Sep 11, 2024

Dataset Features

  • Use Polars functions in .map()
    • Allow Polars as valid output type by @psmyth94 in https://github.com/huggingface/datasets/pull/6762

    • Example:

      >>> from datasets import load_dataset
      >>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars")
      >>> cols = [pl.col("content").str.len_bytes().alias("length")]
      >>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True)
      >>> ds_with_length[:5]
      shape: (5, 5)
      ┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐
      │ idx ┆ title                             ┆ content                           ┆ labels                ┆ length │
      │ --- ┆ ---                               ┆ ---                               ┆ ---                   ┆ ---    │
      │ i64 ┆ str                               ┆ str                               ┆ str                   ┆ u32    │
      ╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡
      │ 0   ┆ The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure      ┆ 180    │
      │ 1   ┆ Pikachu's Quest for Peace         ┆ Pikachu, with his cheeky persona… ┆ peaceful_narrative    ┆ 138    │
      │ 2   ┆ The Tender Tale of Squirtle       ┆ Squirtle took everyone on a memo… ┆ gentle_adventure      ┆ 135    │
      │ 3   ┆ Charizard's Heartwarming Tale     ┆ Charizard found joy in helping o… ┆ heartwarming_story    ┆ 112    │
      │ 4   ┆ Jolteon's Sparkling Journey       ┆ Jolteon, with his zest for life,… ┆ celebratory_narrative ┆ 111    │
      └─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘
      
  • Support NumPy 2

Cache Changes

  • Use huggingface_hub cache by @lhoestq in https://github.com/huggingface/datasets/pull/7105
    • use the huggingface_hub cache for files downloaded from HF, by default at ~/.cache/huggingface/hub
    • cached datasets (Arrow files) will still be reloaded from the datasets cache, by default at ~/.cache/huggingface/datasets

Breaking changes

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.21.0...3.0.0

Aug 14, 2024

Features

  • Support pyarrow large_list by @albertvillanova in https://github.com/huggingface/datasets/pull/7019
    • Support Polars round trip:
      import polars as pl
      from datasets import Dataset
      
      df1 = pl.from_dict({"col_1": [[1, 2], [3, 4]]}
      df2 = Dataset.from_polars(df).to_polars()
      assert df1.equals(df2)
      

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.20.0...2.21.0

Jun 13, 2024

Important

Datasets features

  • [Resumable IterableDataset] Add IterableDataset state_dict by @lhoestq in https://github.com/huggingface/datasets/pull/6658
    • checkpoint and resume an iterable dataset (e.g. when streaming):

      >>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
      >>> for idx, example in enumerate(iterable_dataset):
      ...     print(example)
      ...     if idx == 2:
      ...         state_dict = iterable_dataset.state_dict()
      ...         print("checkpoint")
      ...         break
      >>> iterable_dataset.load_state_dict(state_dict)
      >>> print(f"restart from checkpoint")
      >>> for example in iterable_dataset:
      ...     print(example)
      

      Returns:

      {'a': 0}
      {'a': 1}
      {'a': 2}
      checkpoint
      restart from checkpoint
      {'a': 3}
      {'a': 4}
      {'a': 5}
      

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.20.0

Jun 3, 2024

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.19.1...2.19.2

May 6, 2024

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.19.1

Apr 19, 2024

Dataset Features

  • Add Polars compatibility by @psmyth94 in https://github.com/huggingface/datasets/pull/6531
    • convert to a Polars dataframe using .to_polars();
      import polars as pl
      from datasets import load_dataset
      ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
      ds.to_polars() \
          .groupby("topic") \
          .agg(pl.len(), pl.first()) \
          .sort("len", descending=True)
      
    • Use Polars formatting to return Polars objects when accessing a dataset:
      ds = ds.with_format("polars")
      ds[:10].group_by("kind").len()
      
  • Add fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in https://github.com/huggingface/datasets/pull/6096
    • Save on HF in any file format:
      ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
      ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
      ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
      
  • Add mode parameter to Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/6735
    • Set images to be read in a certain mode like "RGB"
      dataset = dataset.cast_column("image", Image(mode="RGB"))
      
  • Add CLI function to convert script-dataset to Parquet by @albertvillanova in https://github.com/huggingface/datasets/pull/6795
    • run command to open a PR in script-based dataset to convert it to Parquet:
      datasets-cli convert_to_parquet <dataset_id>
      
  • Add Dataset.take and Dataset.skip by @lhoestq in https://github.com/huggingface/datasets/pull/6813
    • same as IterableDataset.take and IterableDataset.skip
      ds = ds.take(10)  # take only the first 10 examples
      

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.18.0...2.19.0

Mar 1, 2024

Dataset features

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.17.1...2.18.0

Feb 19, 2024

Bug Fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.17.0...2.17.1

Feb 9, 2024

Dataset Features

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.16.1...2.17.0

Dec 30, 2023

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.16.0...2.16.1

Dec 22, 2023

Security features

  • Add trust_remote_code argument by @lhoestq in https://github.com/huggingface/datasets/pull/6429
    • Some Hugging Face datasets contain custom code which must be executed to correctly load the dataset. The code can be inspected in the repository content at https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.
    • Passing trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.
    • Using the environment variable HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasets
  • Use parquet export if possible by @lhoestq in https://github.com/huggingface/datasets/pull/6448
    • This allows loading most old datasets based on custom code by downloading the Parquet export provided by Hugging Face
    • You can see a dataset's Parquet export at https://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquet

Features

  • Webdataset dataset builder by @lhoestq in https://github.com/huggingface/datasets/pull/6391
  • Implement get dataset default config name by @albertvillanova in https://github.com/huggingface/datasets/pull/6511
  • Lazy data files resolution and offline cache reload by @lhoestq in https://github.com/huggingface/datasets/pull/6493
    • This speeds up the load_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newer
    • Fix load_dataset that used to reload data from cache even if the dataset was updated on Hugging Face
    • Reload a dataset from your cache even if you don't have internet connection
    • New cache directory scheme for no-script datasets: ~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_sha
    • Backward comaptibility: cached datasets from datasets 2.15 (using the old scheme) are still reloaded from cache

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.15.0...2.16.0

Latest
4.8.4
Tracking Since
Apr 30, 2021
Last fetched Apr 19, 2026