Full Changelog: https://github.com/huggingface/datasets/compare/3.4.0...3.4.1
Faster folder based builder + parquet support + allow repeated media + use torchvideo by @lhoestq in https://github.com/huggingface/datasets/pull/7424
decord with torchvision to read videos, since decord is not maintained anymore and isn't available for recent python versions, see the video dataset loading documentation here for more details. The Video type is still marked as experimental is this versionfrom datasets import load_dataset, Video
dataset = load_dataset("path/to/video/folder", split="train")
dataset[0]["video"] # <torchvision.io.video_reader.VideoReader at 0x1652284c0>
metadata.parquet in addition to metadata.csv or metadata.jsonl for the metadata of the image/audio/video filesAdd IterableDataset.decode with multithreading by @lhoestq in https://github.com/huggingface/datasets/pull/7450
dataset = dataset.decode(num_threads=num_threads)
Add with_split to DatasetDict.map by @jp1924 in https://github.com/huggingface/datasets/pull/7368
string_to_dict to return None if there is no match instead of raising ValueError by @ringohoffman in https://github.com/huggingface/datasets/pull/7435ds.set_epoch(new_epoch) by @lhoestq in https://github.com/huggingface/datasets/pull/7451Full Changelog: https://github.com/huggingface/datasets/compare/3.3.2...3.4.0
Full Changelog: https://github.com/huggingface/datasets/compare/3.3.1...3.3.2
Full Changelog: https://github.com/huggingface/datasets/compare/3.3.0...3.3.1
Support async functions in map() by @lhoestq in https://github.com/huggingface/datasets/pull/7384
prompt = "Answer the following question: {question}. You should think step by step."
async def ask_llm(example):
return await query_model(prompt.format(question=example["question"]))
ds = ds.map(ask_llm)
Add repeat method to datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7198
ds = ds.repeat(10)
Support faster processing using pandas or polars functions in IterableDataset.map() by @lhoestq in https://github.com/huggingface/datasets/pull/7370
ds = load_dataset("ServiceNow-AI/R1-Distill-SFT", "v0", split="train", streaming=True)
ds = ds.with_format("polars")
expr = pl.col("solution").str.extract("boxed\\{(.*)\\}").alias("value_solution")
ds = ds.map(lambda df: df.with_columns(expr), batched=True)
Apply formatting after iter_arrow to speed up format -> map, filter for iterable datasets by @alex-hh in https://github.com/huggingface/datasets/pull/7207
Full Changelog: https://github.com/huggingface/datasets/compare/3.2.0...3.3.0
from datasets import load_dataset
filters = [('date', '>=', '2023')]
ds = load_dataset("HuggingFaceFW/fineweb-2", "fra_Latn", streaming=True, filters=filters)
ClassLabel by @sergiopaniego in https://github.com/huggingface/datasets/pull/7293Full Changelog: https://github.com/huggingface/datasets/compare/3.1.0...3.2.0
>>> from datasets import Dataset, Video, load_dataset
>>> ds = Dataset.from_dict({"video":["path/to/Screen Recording.mov"]}).cast_column("video", Video())
>>> # or from the hub
>>> ds = load_dataset("username/dataset_name", split="train")
>>> ds[0]["video"]
<decord.video_reader.VideoReader at 0x105525c70>
>>> from datasets import load_dataset
>>> full_ds = load_dataset("amphion/Emilia-Dataset", split="train", streaming=True)
>>> full_ds.num_shards
2360
>>> ds = full_ds.shard(num_shards=ds.num_shards, index=0)
>>> ds.num_shards
1
>>> ds = full_ds.shard(num_shards=8, index=0)
>>> ds.num_shards
295
Full Changelog: https://github.com/huggingface/datasets/compare/3.0.2...3.1.0
Full Changelog: https://github.com/huggingface/datasets/compare/3.0.1...3.0.2
Full Changelog: https://github.com/huggingface/datasets/compare/3.0.0...3.0.1
.map()
Allow Polars as valid output type by @psmyth94 in https://github.com/huggingface/datasets/pull/6762
Example:
>>> from datasets import load_dataset
>>> ds = load_dataset("lhoestq/CudyPokemonAdventures", split="train").with_format("polars")
>>> cols = [pl.col("content").str.len_bytes().alias("length")]
>>> ds_with_length = ds.map(lambda df: df.with_columns(cols), batched=True)
>>> ds_with_length[:5]
shape: (5, 5)
┌─────┬───────────────────────────────────┬───────────────────────────────────┬───────────────────────┬────────┐
│ idx ┆ title ┆ content ┆ labels ┆ length │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str ┆ u32 │
╞═════╪═══════════════════════════════════╪═══════════════════════════════════╪═══════════════════════╪════════╡
│ 0 ┆ The Joyful Adventure of Bulbasau… ┆ Bulbasaur embarked on a sunny qu… ┆ joyful_adventure ┆ 180 │
│ 1 ┆ Pikachu's Quest for Peace ┆ Pikachu, with his cheeky persona… ┆ peaceful_narrative ┆ 138 │
│ 2 ┆ The Tender Tale of Squirtle ┆ Squirtle took everyone on a memo… ┆ gentle_adventure ┆ 135 │
│ 3 ┆ Charizard's Heartwarming Tale ┆ Charizard found joy in helping o… ┆ heartwarming_story ┆ 112 │
│ 4 ┆ Jolteon's Sparkling Journey ┆ Jolteon, with his zest for life,… ┆ celebratory_narrative ┆ 111 │
└─────┴───────────────────────────────────┴───────────────────────────────────┴───────────────────────┴────────┘
huggingface_hub cache by @lhoestq in https://github.com/huggingface/datasets/pull/7105
huggingface_hub cache for files downloaded from HF, by default at ~/.cache/huggingface/hubdatasets cache, by default at ~/.cache/huggingface/datasetsuse_auth_token, fs or ignore_verificationsload_metric, please use the evaluate library insteadtask argument in load_dataset() .prepare_for_task() method, datasets.tasks modulecache_dir from cache_file_name by @ringohoffman in https://github.com/huggingface/datasets/pull/7096Full Changelog: https://github.com/huggingface/datasets/compare/2.21.0...3.0.0
import polars as pl
from datasets import Dataset
df1 = pl.from_dict({"col_1": [[1, 2], [3, 4]]}
df2 = Dataset.from_polars(df).to_polars()
assert df1.equals(df2)
HF_HUB_OFFLINE instead of HF_DATASETS_OFFLINE by @Wauplin in https://github.com/huggingface/datasets/pull/6968Full Changelog: https://github.com/huggingface/datasets/compare/2.20.0...2.21.0
trust_remote_code=True by @lhoestq in https://github.com/huggingface/datasets/pull/6954
trust_remote_code=True to be usedcheckpoint and resume an iterable dataset (e.g. when streaming):
>>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
>>> for idx, example in enumerate(iterable_dataset):
... print(example)
... if idx == 2:
... state_dict = iterable_dataset.state_dict()
... print("checkpoint")
... break
>>> iterable_dataset.load_state_dict(state_dict)
>>> print(f"restart from checkpoint")
>>> for example in iterable_dataset:
... print(example)
Returns:
{'a': 0}
{'a': 1}
{'a': 2}
checkpoint
restart from checkpoint
{'a': 3}
{'a': 4}
{'a': 5}
.pth support for torch tensors by @lhoestq in https://github.com/huggingface/datasets/pull/6920dataset_module_factory by @Wauplin in https://github.com/huggingface/datasets/pull/6959Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.20.0
Full Changelog: https://github.com/huggingface/datasets/compare/2.19.1...2.19.2
Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.19.1
.to_polars();
import polars as pl
from datasets import load_dataset
ds = load_dataset("DIBT/10k_prompts_ranked", split="train")
ds.to_polars() \
.groupby("topic") \
.agg(pl.len(), pl.first()) \
.sort("len", descending=True)
ds = ds.with_format("polars")
ds[:10].group_by("kind").len()
fsspec support for to_json, to_csv, and to_parquet by @alvarobartt in https://github.com/huggingface/datasets/pull/6096
ds.to_json("hf://datasets/username/my_json_dataset/data.jsonl")
ds.to_csv("hf://datasets/username/my_csv_dataset/data.csv")
ds.to_parquet("hf://datasets/username/my_parquet_dataset/data.parquet")
mode parameter to Image feature by @mariosasko in https://github.com/huggingface/datasets/pull/6735
dataset = dataset.cast_column("image", Image(mode="RGB"))
datasets-cli convert_to_parquet <dataset_id>
ds = ds.take(10) # take only the first 10 examples
remove_columns/rename_columns doc fixes by @mariosasko in https://github.com/huggingface/datasets/pull/6772uv in CI by @mariosasko in https://github.com/huggingface/datasets/pull/6779_check_legacy_cache2 by @lhoestq in https://github.com/huggingface/datasets/pull/6792DatasetBuilder._split_generators incomplete type annotation by @JonasLoos in https://github.com/huggingface/datasets/pull/6799CachedDatasetModuleFactory and Cache by @izhx in https://github.com/huggingface/datasets/pull/6754os.path.relpath in resolve_patterns by @mariosasko in https://github.com/huggingface/datasets/pull/6815Dataset.__getitem__ by @mariosasko in https://github.com/huggingface/datasets/pull/6817Full Changelog: https://github.com/huggingface/datasets/compare/2.18.0...2.19.0
num_workers could lead to incorrect shards assignments to workers and cause errorsxlistdir by @mariosasko in https://github.com/huggingface/datasets/pull/6698Full Changelog: https://github.com/huggingface/datasets/compare/2.17.1...2.18.0
arrow_writer.py from #6636 by @bryant1410 in https://github.com/huggingface/datasets/pull/6664Full Changelog: https://github.com/huggingface/datasets/compare/2.17.0...2.17.1
drop_last_batchin map after shuffling or sharding by @lhoestq in https://github.com/huggingface/datasets/pull/6575setup.cfg to pyproject.toml by @mariosasko in https://github.com/huggingface/datasets/pull/6619tqdm bars in non-interactive environments by @mariosasko in https://github.com/huggingface/datasets/pull/6627with_rank param to Dataset.filter by @mariosasko in https://github.com/huggingface/datasets/pull/6608Full Changelog: https://github.com/huggingface/datasets/compare/2.16.1...2.17.0
cache_dir to load_datasetload_dataset("ted_talks_iwslt", language_pair=("ja", "en"), year="2015")Full Changelog: https://github.com/huggingface/datasets/compare/2.16.0...2.16.1
https://hf.co/datasets/<repo_id>. A warning is shown to let the user know about the custom code, and they can avoid this message in future by passing the argument trust_remote_code=True.trust_remote_code=True will be mandatory to load these datasets from the next major release of datasets.HF_DATASETS_TRUST_REMOTE_CODE=0 you can already disable custom code by default without waiting for the next release of datasetshttps://hf.co/datasets/<repo_id>/tree/refs%2Fconvert%2Fparquetload_dataset step that lists the data files of big repositories (up to x100) but requires huggingface_hub 0.20 or newerload_dataset that used to reload data from cache even if the dataset was updated on Hugging Face~/.cache/huggingface/datasets/username___dataset_name/config_name/version/commit_shadatasets 2.15 (using the old scheme) are still reloaded from cache_get_data_files_patterns by @lhoestq in https://github.com/huggingface/datasets/pull/6343usedforsecurity=False in hashlib methods (FIPS compliance) by @Wauplin in https://github.com/huggingface/datasets/pull/6414ruff for formatting by @mariosasko in https://github.com/huggingface/datasets/pull/6434tqdm wrapper by @mariosasko in https://github.com/huggingface/datasets/pull/6433Table.__getstate__ and Table.__setstate__ by @LZHgrla in https://github.com/huggingface/datasets/pull/6444filelock package for file locking by @mariosasko in https://github.com/huggingface/datasets/pull/6445** by @mariosasko in https://github.com/huggingface/datasets/pull/6449dill logic by @mariosasko in https://github.com/huggingface/datasets/pull/6454push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6461__repr__ by @lhoestq in https://github.com/huggingface/datasets/pull/6480torch.Generator objects by @mariosasko in https://github.com/huggingface/datasets/pull/6502list_files_info with list_repo_tree in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6510Full Changelog: https://github.com/huggingface/datasets/compare/2.15.0...2.16.0