Full Changelog: https://github.com/huggingface/datasets/compare/4.8.3...4.8.4
Full Changelog: https://github.com/huggingface/datasets/compare/4.8.2...4.8.3
Full Changelog: https://github.com/huggingface/datasets/compare/4.8.1...4.8.2
Full Changelog: https://github.com/huggingface/datasets/compare/4.8.0...4.8.1
Read (and write) from HF Storage Buckets: load raw data, process and save to Dataset Repos by @lhoestq in https://github.com/huggingface/datasets/pull/8064
from datasets import load_dataset
# load raw data from a Storage Bucket on HF
ds = load_dataset("buckets/username/data-bucket", data_files=["*.jsonl"])
# or manually, using hf:// paths
ds = load_dataset("json", data_files=["hf://buckets/username/data-bucket/*.jsonl"])
# process, filter
ds = ds.map(...).filter(...)
# publish the AI-ready dataset
ds.push_to_hub("username/my-dataset-ready-for-training")
This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork).
And it bumps dill and multiprocess versions to support python 3.14
Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in https://github.com/huggingface/datasets/pull/8068
max_shard_size to IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)zip://*.jsonl::hf://datasets/username/dataset-name/data.zipFull Changelog: https://github.com/huggingface/datasets/compare/4.7.0...4.8.0
Json() type by @lhoestq in https://github.com/huggingface/datasets/pull/8027
Json()type is used to store such data that would normally not be supported in Arrow/ParquetJson() type in Features() for any dataset, it is supported in any functions that accepts features=like load_dataset(), .map(), .cast(), .from_dict(), .from_list()on_mixed_types="use_json" to automatically set the Json() type on mixed types in .from_dict(), .from_list() and .map()Examples:
You can use on_mixed_types="use_json" or specify features= with a [Json] type:
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
...
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64
>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]
This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]] # missing fields are filled with None
>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]] # OK
Another example with tool calling data and the on_mixed_types="use_json" argument (useful to not have to specify features= manually):
>>> messages = [
... {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
... {"role": "assistant", "tool_calls": [
... {"type": "function", "function": {
... "name": "control_light",
... "arguments": {"room": "living room", "state": "on"}
... }},
... {"type": "function", "function": {
... "name": "play_music",
... "arguments": {"playlist": "electronic"} # mixed-type here since keys ["playlist"] and ["room", "state"] are different
... }}]
... },
... {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
... {"role": "tool", "name": "play_music", "content": "The music is now playing."},
... {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}
Full Changelog: https://github.com/huggingface/datasets/compare/4.6.1...4.7.0
Full Changelog: https://github.com/huggingface/datasets/compare/4.6.0...4.6.1
Support Image, Video and Audio types in Lance datasets
>>> from datasets import load_dataset
>>> ds = load_dataset("lance-format/Openvid-1M", streaming=True, split="train")
>>> ds.features
{'video_blob': Video(),
'video_path': Value('string'),
'caption': Value('string'),
'aesthetic_score': Value('float64'),
'motion_score': Value('float64'),
'temporal_consistency_score': Value('float64'),
'camera_motion': Value('string'),
'frame': Value('int64'),
'fps': Value('float64'),
'seconds': Value('float64'),
'embedding': List(Value('float32'), length=1024)}
Push to hub now supports Video types
>>> from datasets import Dataset, Video
>>> ds = Dataset.from_dict({"video": ["path/to/video.mp4"]})
>>> ds = ds.cast_column("video", Video())
>>> ds.push_to_hub("username/my-video-dataset")
Write image/audio/video blobs as is in parquet (PLAIN) in push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7976
Add IterableDataset.reshard() by @lhoestq in https://github.com/huggingface/datasets/pull/7992
Reshard the dataset if possible, i.e. split the current shards further into more shards. This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards. Equality may happen if no shard can be split further.
The resharding mechanism depends on the dataset file format:
>>> from datasets import load_dataset
>>> ds = load_dataset("fancyzhx/amazon_polarity", split="train", streaming=True)
>>> ds
IterableDataset({
features: ['label', 'title', 'content'],
num_shards: 4
})
>>> ds.reshard()
IterableDataset({
features: ['label', 'title', 'content'],
num_shards: 3600
})
transformers v5 and huggingface_hub v1 by @hanouticelina in https://github.com/huggingface/datasets/pull/7989Full Changelog: https://github.com/huggingface/datasets/compare/4.5.0...4.6.0
Add lance format support by @eddyxu in https://github.com/huggingface/datasets/pull/7913
from datasets import load_dataset
ds = load_dataset("lance-format/fineweb-edu", streaming=True)
for example in ds["train"]:
...
revision in load_dataset by @Scott-Simmons in https://github.com/huggingface/datasets/pull/7929Full Changelog: https://github.com/huggingface/datasets/compare/4.4.2...4.5.0
Full Changelog: https://github.com/huggingface/datasets/compare/4.4.1...4.4.2
Full Changelog: https://github.com/huggingface/datasets/compare/4.4.0...4.4.1
Add nifti support by @CloseChoice in https://github.com/huggingface/datasets/pull/7815
ds = load_dataset("username/my_nifti_dataset")
ds["train"][0] # {"nifti": <nibabel.nifti1.Nifti1Image>}
files = ["/path/to/scan_001.nii.gz", "/path/to/scan_002.nii.gz"]
ds = Dataset.from_dict({"nifti": files}).cast_column("nifti", Nifti())
ds["train"][0] # {"nifti": <nibabel.nifti1.Nifti1Image>}
Add num channels to audio by @CloseChoice in https://github.com/huggingface/datasets/pull/7840
# samples have shape (num_channels, num_samples)
ds = ds.cast_column("audio", Audio()) # default, use all channels
ds = ds.cast_column("audio", Audio(num_channels=2)) # use stereo
ds = ds.cast_column("audio", Audio(num_channels=1)) # use mono
_batch_setitems() by @sghng in https://github.com/huggingface/datasets/pull/7817Full Changelog: https://github.com/huggingface/datasets/compare/4.3.0...4.4.0
Enable large scale distributed dataset streaming:
These improvements require huggingface_hub>=1.1.0 to take full effect
from_generator by @simonreise in https://github.com/huggingface/datasets/pull/7533Full Changelog: https://github.com/huggingface/datasets/compare/4.2.0...4.3.0
Sample without replacement option when interleaving datasets by @radulescupetru in https://github.com/huggingface/datasets/pull/7786
ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
Parquet: add on_bad_files argument to error/warn/skip bad files by @lhoestq in https://github.com/huggingface/datasets/pull/7806
ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
Add parquet scan options and docs by @lhoestq in https://github.com/huggingface/datasets/pull/7801
ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])
fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)
Full Changelog: https://github.com/huggingface/datasets/compare/4.1.1...4.2.0
Full Changelog: https://github.com/huggingface/datasets/compare/4.1.0...4.1.1
feat: use content defined chunking by @kszucs in https://github.com/huggingface/datasets/pull/7589
Parquet datasets are now Optimized Parquet !
<img width="462" height="103" alt="image" src="https://github.com/user-attachments/assets/43703a47-0964-421b-8f01-1a790305de79" />internally uses use_content_defined_chunking=True when writing Parquet files
this enables fast deduped uploads to Hugging Face !
# Now faster thanks to content defined chunking
ds.push_to_hub("username/dataset_name")
write_page_index=True is also used to enable fast random access for the Dataset Viewer and tools that need itConcurrent push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7708
Concurrent IterableDataset push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7710
HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7690
ds = load_dataset("username/dataset-with-hdf5-files")
train_test_split by @qgallouedec in https://github.com/huggingface/datasets/pull/7736Full Changelog: https://github.com/huggingface/datasets/compare/4.0.0...4.1.0
Add IterableDataset.push_to_hub() by @lhoestq in https://github.com/huggingface/datasets/pull/7595
# Build streaming data pipelines in a few lines of code !
from datasets import load_dataset
ds = load_dataset(..., streaming=True)
ds = ds.map(...).filter(...)
ds.push_to_hub(...)
Add num_proc= to .push_to_hub() (Dataset and IterableDataset) by @lhoestq in https://github.com/huggingface/datasets/pull/7606
# Faster push to Hub ! Available for both Dataset and IterableDataset
ds.push_to_hub(..., num_proc=8)
New Column object
# Syntax:
ds["column_name"] # datasets.Column([...]) or datasets.IterableColumn(...)
# Iterate on a column:
for text in ds["text"]:
...
# Load one cell without bringing the full column in memory
first_text = ds["text"][0] # equivalent to ds[0]["text"]
Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
# Don't download full audios/videos when it's not necessary
# Now with torchcodec it only streams the required ranges/frames:
from datasets import load_dataset
ds = load_dataset(..., streaming=True)
for example in ds:
video = example["video"]
frames = video.get_frames_in_range(start=0, stop=6, step=1) # only stream certain frames
torch>=2.7.0 and FFmpeg >= 4datasets<4.0AudioDecoder:audio = dataset[0]["audio"] # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>
samples = audio.get_all_samples() # or use get_samples_played_in_range(...)
samples.data # tensor([[ 0.0000e+00, 0.0000e+00, 0.0000e+00, ..., 2.3447e-06, -1.9127e-04, -5.3330e-05]]
samples.sample_rate # 16000
# old syntax is still supported
array, sr = audio["array"], audio["sampling_rate"]
VideoDecoder:video = dataset[0]["video"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>
first_frame = video.get_frame_at(0)
first_frame.data.shape # (3, 240, 320)
first_frame.pts_seconds # 0.0
frames = video.get_frames_in_range(0, 6, 1)
frames.data.shape # torch.Size([5, 3, 240, 320])
Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592
trust_remote_code is no longer supportedTorchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616
Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634
List typefrom datasets import Features, List, Value
features = Features({
"texts": List(Value("string")),
"four_paragraphs": List(Value("string"), length=4)
})
Sequence was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a List or a dict depending on the subfeaturefrom datasets import Sequence
Sequence(Value("string")) # List(Value("string"))
Sequence({"texts": Value("string")}) # {"texts": List(Value("string"))}
Dataset.map to reuse cache files mapped with different num_proc by @ringohoffman in https://github.com/huggingface/datasets/pull/7434RepeatExamplesIterable by @SilvanCodes in https://github.com/huggingface/datasets/pull/7581_dill.py to use co_linetable for Python 3.10+ in place of co_lnotab by @qgallouedec in https://github.com/huggingface/datasets/pull/7609Full Changelog: https://github.com/huggingface/datasets/compare/3.6.0...4.0.0
aiohttp from direct dependencies by @akx in https://github.com/huggingface/datasets/pull/7294Full Changelog: https://github.com/huggingface/datasets/compare/3.5.1...3.6.0
TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'Full Changelog: https://github.com/huggingface/datasets/compare/3.5.0...3.5.1
>>> from datasets import load_dataset, Pdf
>>> repo = "path/to/pdf/folder" # or username/dataset_name on Hugging Face
>>> dataset = load_dataset(repo, split="train")
>>> dataset[0]["pdf"]
<pdfplumber.pdf.PDF at 0x1075bc320>
>>> dataset[0]["pdf"].pages[0].extract_text()
...
Full Changelog: https://github.com/huggingface/datasets/compare/3.4.1...3.5.0