releases.shpreview
Hugging Face/Datasets

Datasets

$npx -y @buildinternet/releases show datasets
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases8Avg3/moVersionsv4.5.0 → v4.8.3
Mar 23, 2026

What's Changed

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.3...4.8.4

Mar 19, 2026

What's Changed

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.2...4.8.3

Mar 17, 2026

What's Changed

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.1...4.8.2

What's Changed

Full Changelog: https://github.com/huggingface/datasets/compare/4.8.0...4.8.1

Mar 16, 2026

Dataset Features

  • Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in https://github.com/huggingface/datasets/pull/8064

    from datasets import load_dataset
    # load raw data from a Storage Bucket on HF
    ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"])
    # or manually, using hf:// paths
    ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"])
    # process, filter
    ds = ds.map(...).filter(...)
    # publish the AI-ready dataset
    ds.push_to_hub("username/my-dataset-ready-for-training")
    

    This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork). And it bumps dill and multiprocess versions to support python 3.14

  • Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in https://github.com/huggingface/datasets/pull/8068

    • added max_shard_size to IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)
    • more arrow-native iterable operations for IterableDataset
    • better support of glob patterns in archives, e.g. zip://*.jsonl::hf://datasets/username/dataset-name/data.zip
    • fixes for to_pandas, videofolder, load_dataset_builder kwargs

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.7.0...4.8.0

Mar 9, 2026

Datasets Features

  • Add Json() type by @lhoestq in https://github.com/huggingface/datasets/pull/8027
    • JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the Json()type is used to store such data that would normally not be supported in Arrow/Parquet
    • Use the Json() type in Features() for any dataset, it is supported in any functions that accepts features=like load_dataset(), .map(), .cast(), .from_dict(), .from_list()
    • Use on_mixed_types="use_json" to automatically set the Json() type on mixed types in .from_dict(), .from_list() and .map()

Examples:

You can use on_mixed_types="use_json" or specify features= with a [Json] type:

>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
  ...
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64

>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]

This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:

>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]]  # missing fields are filled with None

>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]]  # OK

Another example with tool calling data and the on_mixed_types="use_json" argument (useful to not have to specify features= manually):

>>> messages = [
...     {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
...     {"role": "assistant", "tool_calls": [
...         {"type": "function", "function": {
...             "name": "control_light",
...             "arguments": {"room": "living room", "state": "on"}
...         }},
...         {"type": "function", "function": {
...             "name": "play_music",
...             "arguments": {"playlist": "electronic"}  # mixed-type here since keys ["playlist"] and ["room", "state"] are different
...         }}]
...     },
...     {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
...     {"role": "tool", "name": "play_music", "content": "The music is now playing."},
...     {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.6.1...4.7.0

Feb 27, 2026
Feb 25, 2026

Dataset Features

  • Support Image, Video and Audio types in Lance datasets

    >>> from datasets import load_dataset
    >>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
    >>> ds.features
    {'video_blob': Video(),
     'video_path': Value('string'),
     'caption': Value('string'),
     'aesthetic_score': Value('float64'),
     'motion_score': Value('float64'),
     'temporal_consistency_score': Value('float64'),
     'camera_motion': Value('string'),
     'frame': Value('int64'),
     'fps': Value('float64'),
     'seconds': Value('float64'),
     'embedding': List(Value('float32'), length=1024)}
    
  • Push to hub now supports Video types

     >>> from datasets import Dataset, Video
    >>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
    >>> ds = ds.cast_column("video", Video())
    >>> ds.push_to_hub("username/my-video-dataset")
    
  • Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7976

    • this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face
    • E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format
    • See more info here: https://huggingface.co/docs/hub/en/xet/deduplication
<p align="center"> <a href="https://huggingface.co/docs/hub/en/xet/deduplication"> <img height="200" alt="image" src="https://github.com/user-attachments/assets/dd0de6a2-24a1-4945-8d25-44b763c1151e" /> </a> </p>
  • Add IterableDataset.reshard() by @lhoestq in https://github.com/huggingface/datasets/pull/7992

    Reshard the dataset if possible, i.e. split the current shards further into more shards. This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards. Equality may happen if no shard can be split further.

    The resharding mechanism depends on the dataset file format:

    • Parquet: shard per row group instead of per file
    • Other: not implemented yet (contributions are welcome !)
    >>> from datasets import load_dataset
    >>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
    >>> ds
    IterableDataset({
        features: ['label', 'title', 'content'],
        num_shards: 4
    })
    >>> ds.reshard()
    IterableDataset({
        features: ['label', 'title', 'content'],
        num_shards: 3600
    })
    

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.5.0...4.6.0

Jan 14, 2026

Dataset Features

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.4.2...4.5.0

Dec 19, 2025

Bug fixes

Minor additions

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.4.1...4.4.2

Nov 5, 2025

Bug fixes and improvements

Full Changelog: https://github.com/huggingface/datasets/compare/4.4.0...4.4.1

Nov 4, 2025

Dataset Features

# samples have shape (num_channels, num_samples)
ds = ds.cast_column("audio", Audio())  # default, use all channels
ds = ds.cast_column("audio", Audio(num_channels=2))  # use stereo
ds = ds.cast_column("audio", Audio(num_channels=1))  # use mono

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.3.0...4.4.0

Oct 23, 2025

Dataset Features

Enable large scale distributed dataset streaming:

These improvements require huggingface_hub>=1.1.0 to take full effect

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.2.0...4.3.0

Oct 9, 2025

Dataset Features

  • Sample without replacement option when interleaving datasets by @radulescupetru in https://github.com/huggingface/datasets/pull/7786

    ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
    
  • Parquet: add on_bad_files argument to error/warn/skip bad files by @lhoestq in https://github.com/huggingface/datasets/pull/7806

    ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
    
  • Add parquet scan options and docs by @lhoestq in https://github.com/huggingface/datasets/pull/7801

    • docs to select columns and filter data efficiently
    ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
    ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])
    
    • new argument to control buffering and caching when streaming
    fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
    ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)
    

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.1.1...4.2.0

Sep 18, 2025

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.1.0...4.1.1

Sep 15, 2025

Dataset Features

  • feat: use content defined chunking by @kszucs in https://github.com/huggingface/datasets/pull/7589

    • Parquet datasets are now Optimized Parquet !

      <img width="462" height="103" alt="image" src="https://github.com/user-attachments/assets/43703a47-0964-421b-8f01-1a790305de79" />
    • internally uses use_content_defined_chunking=True when writing Parquet files

    • this enables fast deduped uploads to Hugging Face !

    # Now faster thanks to content defined chunking
    ds.push_to_hub("username/dataset_name")
    
    • this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.
    • with this change, the new default row group size for Parquet is set to 100MB
    • write_page_index=True is also used to enable fast random access for the Dataset Viewer and tools that need it
  • Concurrent push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7708

  • Concurrent IterableDataset push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7710

  • HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7690

    • load HDF5 datasets in one line of code
    ds = load_dataset("username/dataset-with-hdf5-files")
    
    • each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows

Other improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/4.0.0...4.1.0

Jul 9, 2025

New Features

  • Add IterableDataset.push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7595

    # Build streaming data pipelines in a few lines of code !
    from datasets import load_dataset
    
    ds = load_dataset(..., streaming=True)
    ds = ds.map(...).filter(...)
    ds.push_to_hub(...)
    
  • Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in https://github.com/huggingface/datasets/pull/7606

    # Faster push to Hub ! Available for both Dataset and IterableDataset
    ds.push_to_hub(..., num_proc=8)
    
  • New Column object

    # Syntax:
    ds["column_name"]  # datasets.Column([...]) or datasets.IterableColumn(...)
    
    # Iterate on a column:
    for text in ds["text"]:
        ...
    
    # Load one cell without bringing the full column in memory
    first_text = ds["text"][0]  # equivalent to ds[0]["text"]
    
  • Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616

    • Enables streaming only the ranges you need !
    # Don't download full audios/videos when it's not necessary
    # Now with torchcodec it only streams the required ranges/frames:
    from datasets import load_dataset
    
    ds = load_dataset(..., streaming=True)
    for example in ds:
        video = example["video"]
        frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames
    
    • Requires torch>=2.7.0 and FFmpeg >= 4
    • Not available for Windows yet but it is coming soon - in the meantime please use datasets<4.0
    • Load audio data with AudioDecoder:
    audio = dataset[0]["audio"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
    samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)
    samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]
    samples.sample_rate  # 16000
    
    # old syntax is still supported
    array, sr = audio["array"], audio["sampling_rate"]
    
    • Load video data with VideoDecoder:
    video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
    first_frame = video.get_frame_at(0)
    first_frame.data.shape  # (3, 240, 320)
    first_frame.pts_seconds  # 0.0
    frames = video.get_frames_in_range(0, 6, 1)
    frames.data.shape  # torch.Size([5, 3, 240, 320])
    

Breaking changes

  • Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592

    • trust_remote_code is no longer supported
  • Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616

    • torchcodec replaces soundfile for audio decoding
    • torchcodec replaces decord for video decoding
  • Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634

    • Introduction of the List type
    from datasets import Features, List, Value
    
    features = Features({
        "texts": List(Value("string")),
        "four_paragraphs": List(Value("string"), length=4)
    })
    
    • Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeature
    from datasets import Sequence
    
    Sequence(Value("string"))  # List(Value("string"))
    Sequence({"texts": Value("string")})  # {"texts": List(Value("string"))}
    

Other improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.6.0...4.0.0

May 7, 2025

Dataset Features

Other improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.5.1...3.6.0

Apr 28, 2025

Bug fixes

Other improvements

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.5.0...3.5.1

Mar 27, 2025

Datasets Features

>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder"  # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/3.4.1...3.5.0

Previous123Next
Latest
4.8.4
Tracking Since
Apr 30, 2021
Last fetched Apr 19, 2026