releases.shpreview
Hugging Face/Datasets

Datasets

$npx -y @buildinternet/releases show datasets
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases8Avg3/moVersionsv4.5.0 → v4.8.3
Nov 16, 2023

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.7...2.15.0

Nov 15, 2023

Bug Fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.6...2.14.7

Oct 24, 2023

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.5...2.14.6

Bug fixes

Other improvements

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5

Sep 6, 2023

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.13.2

Aug 8, 2023

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.3...2.14.4

Aug 3, 2023

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.2...2.14.3

Jul 31, 2023

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.1...2.14.2

Jul 27, 2023

Bug fixes

Other improvements

Full Changelog: https://github.com/huggingface/datasets/compare/2.14.0...2.14.1

Jul 24, 2023

Important: caching

  • Datasets downloaded and cached using datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).
  • Datasets that were already cached are still supported.
  • This affects datasets on Hugging Face without dataset scripts, e.g. made of pure parquet, csv, jsonl, etc. files.
  • This is due to the default configuration name for those datasets have been fixed (from "username--dataset_name" to "default") in https://github.com/huggingface/datasets/pull/5331.

Dataset Configuration

  • Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331

    • Configure your dataset using YAML at the top of your dataset card (docs here)
    • Choose which file goes into which split
      ---
      configs:
      - config_name: default
        data_files:
        - split: train
           path: data.csv
        - split: test
            path: holdout.csv
      ---
    • Define multiple dataset configurations
      ---
      configs:
      - config_name: main_data
        data_files: main_data.csv
      - config_name: additional_data
        data_files: additional_data.csv
      ---

Dataset Features

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.14.0

Jun 22, 2023

General improvements and bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.13.0...2.13.1

Jun 14, 2023

Dataset Features

  • Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770

    • Stream the data from your Spark DataFrame directly to your training pipeline
    from datasets import IterableDataset
    from torch.utils.data import DataLoader
    
    ids = IterableDataset.from_spark(df)
    ids = ids.map(...).filter(...).with_format("torch")
    for batch in DataLoader(ids, batch_size=16, num_workers=4):
        ...
    
  • IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:

    from datasets import load_dataset
    
    ids = load_dataset("c4", "en", split="train", streaming=True)
    ids = ids.map(...).with_format("torch")  # to get PyTorch tensors - also works with tf, np, jax etc.
    
  • Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893

    from datasets import IterableDataset
    
    ids = IterableDataset.from_file("path/to/data.arrow")
    
  • Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944

    from datasets import load_dataset
    
    ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
    

Experimental

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef

Apr 28, 2023

Datasets Features

  • Add Dataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5701

    • Get a Dataset from a Spark DataFrame (docs):
    >>> from datasets import Dataset
    >>> ds = Dataset.from_spark(df)
    
  • Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in https://github.com/huggingface/datasets/pull/5689

    • Stream data from Wikipedia:
    >>> from datasets import load_dataset
    >>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
    >>> next(iter(ds["train"]))
    {'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}
    
  • Implement sharding on merged iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5735

    • Use interleaved datasets in a distributed setup or with a DataLoader
    >>> from datasets import load_dataset, interleave_datasets
    >>> from torch.utils.data import DataLoader
    >>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
    >>> c4 = load_dataset("c4", "en", split="train", streaming=True)
    >>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
    >>> dataloader = DataLoader(merged, num_workers=4)
    
  • Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in https://github.com/huggingface/datasets/pull/5751

    • Return a list of lists instead of a list of NumPy arrays when converting the variable-shaped ArrayND to Python
    • Improve the NumPy conversion by returning a numeric NumPy array when the offsets are equal or a NumPy object array when they aren't
    • Allow converting the variable-shaped ArrayND to Pandas

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.11.0...2.12.0

Mar 29, 2023

Important

  • Use soundfile for mp3 decoding instead of torchaudio by @polinaeterna in https://github.com/huggingface/datasets/pull/5573
    • this allows to not have dependencies on pytorch to decode audio files
    • this was possible with soundfile 0.12 which bundles libsndfile binaries at a recent version with MP3 support
  • Deprecated batch_size on Dataset.to_dict()

Datasets Features

General imrovements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.10.0...2.11.0

Feb 28, 2023

What's Changed

Full Changelog: https://github.com/huggingface/datasets/compare/2.10.0...2.10.1

Feb 22, 2023

Important

  • Avoid saving sparse ChunkedArrays in pyarrow tables by @marioga in https://github.com/huggingface/datasets/pull/5542
    • Big improvements on the speed of .flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasets
  • Skip dataset verifications by default by @mariosasko in https://github.com/huggingface/datasets/pull/5303
    • introduces multiple verification_mode you can pass to `load_dataset()):
    • the new default verification steps are much faster (no need to compute expensive checksums)

Datasets features

Documentation

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.9.0...ef

Jan 26, 2023

Datasets Features

Documentation

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.8.0...2.9.0

Dec 19, 2022

Important

  • Removed YAML integer keys from class_label metadata by @albertvillanova in https://github.com/huggingface/datasets/pull/5277
    • From now on, datasets pushed on the Hub and using ClassLabel will use a new YAML model to store the feature types
    • The new model uses strings instead of integers for the ids in label name mapping (e.g. 0 -> "0"). This is due to the Hub limitations. In a few months the Hub may stop allowing users to push the old YAML model.
    • Old versions of datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.

Datasets Features

Docs

General improvements and bug fixes

New Contributors

Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.8.0

Nov 22, 2022

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.6.2

Bug fixes

Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.7.1

Latest
4.8.4
Tracking Since
Apr 30, 2021
Last fetched Apr 19, 2026