dl_manager.iter_files when they are given as input by @mariosasko in https://github.com/huggingface/datasets/pull/6230audio.py by @mariosasko in https://github.com/huggingface/datasets/pull/6241apache_beam import in BeamBasedBuilder._save_info by @mariosasko in https://github.com/huggingface/datasets/pull/6265tensorflow maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6301jax maximum version by @mariosasko in https://github.com/huggingface/datasets/pull/6300push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/6269fsspec version to the datasets-cli env command output by @mariosasko in https://github.com/huggingface/datasets/pull/6356Dataset.map docstring by @bryant1410 in https://github.com/huggingface/datasets/pull/6373Image by @mariosasko in https://github.com/huggingface/datasets/pull/6379Full Changelog: https://github.com/huggingface/datasets/compare/2.14.7...2.15.0
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.6...2.14.7
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.5...2.14.6
iter_files for hidden files by @mariosasko in https://github.com/huggingface/datasets/pull/6092columns by @mariosasko in https://github.com/huggingface/datasets/pull/6160datasets_info.json but no README by @clefourrier in https://github.com/huggingface/datasets/pull/6164revision argument by @qgallouedec in https://github.com/huggingface/datasets/pull/6191Dataset.export by @mariosasko in https://github.com/huggingface/datasets/pull/6081download_custom by @mariosasko in https://github.com/huggingface/datasets/pull/6093select_columns to guide by @unifyh in https://github.com/huggingface/datasets/pull/6119to_iterable_dataset by @stevhliu in https://github.com/huggingface/datasets/pull/6158image_load doc by @mariosasko in https://github.com/huggingface/datasets/pull/6181huggingface/documentation-images by @mariosasko in https://github.com/huggingface/datasets/pull/6177hf-internal-testing repos for hosting test dataset repos by @mariosasko in https://github.com/huggingface/datasets/pull/6180Full Changelog: https://github.com/huggingface/datasets/compare/2.14.4...2.14.5
Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.13.2
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.3...2.14.4
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.2...2.14.3
Full Changelog: https://github.com/huggingface/datasets/compare/2.14.1...2.14.2
Overview.ipynb & detach Jupyter Notebooks from datasets repository by @alvarobartt in https://github.com/huggingface/datasets/pull/5902Full Changelog: https://github.com/huggingface/datasets/compare/2.14.0...2.14.1
datasets>=2.14.0 may not be reloaded from cache using older version of datasets (and therefore re-downloaded).Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331
---
configs:
- config_name: default
data_files:
- split: train
path: data.csv
- split: test
path: holdout.csv
---
---
configs:
- config_name: main_data
data_files: main_data.csv
- config_name: additional_data
data_files: additional_data.csv
---
Support for multiple configs via metadata yaml info by @polinaeterna in https://github.com/huggingface/datasets/pull/5331
push_to_hub() additional dataset configurationsds.push_to_hub("username/dataset_name", config_name="additional_data")
# reload later
ds = load_dataset("username/dataset_name", "additional_data")
Support returning dataframe in map transform by @mariosasko in https://github.com/huggingface/datasets/pull/5995
errors param in favor of encoding_errors in text builder by @mariosasko in https://github.com/huggingface/datasets/pull/5974huggingface_hub's RepoCard API by @mariosasko in https://github.com/huggingface/datasets/pull/5949joblib to avoid joblibspark test failures by @mariosasko in https://github.com/huggingface/datasets/pull/6000column_names type check with type hint in sort by @mariosasko in https://github.com/huggingface/datasets/pull/6001use_auth_token in favor of token by @mariosasko in https://github.com/huggingface/datasets/pull/5996ClassLabel min max check for None values by @mariosasko in https://github.com/huggingface/datasets/pull/6023task_templates in IterableDataset when they are no longer valid by @mariosasko in https://github.com/huggingface/datasets/pull/6027HfFileSystem and deprecate S3FileSystem by @mariosasko in https://github.com/huggingface/datasets/pull/6052Dataset.from_list docstring by @mariosasko in https://github.com/huggingface/datasets/pull/6062features are specified by @mariosasko in https://github.com/huggingface/datasets/pull/6045Full Changelog: https://github.com/huggingface/datasets/compare/2.13.1...2.14.0
list_datasets by @mariosasko in https://github.com/huggingface/datasets/pull/5964encoding and errors params to JSON loader by @mariosasko in https://github.com/huggingface/datasets/pull/5969Full Changelog: https://github.com/huggingface/datasets/compare/2.13.0...2.13.1
Add IterableDataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5770
from datasets import IterableDataset
from torch.utils.data import DataLoader
ids = IterableDataset.from_spark(df)
ids = ids.map(...).filter(...).with_format("torch")
for batch in DataLoader(ids, batch_size=16, num_workers=4):
...
IterableDataset formatting for PyTorch, TensorFlow, Jax, NumPy and Arrow:
from datasets import load_dataset
ids = load_dataset("c4", "en", split="train", streaming=True)
ids = ids.map(...).with_format("torch") # to get PyTorch tensors - also works with tf, np, jax etc.
Add IterableDataset.from_file to load local dataset as iterable by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5893
from datasets import IterableDataset
ids = IterableDataset.from_file("path/to/data.arrow")
Arrow dataset builder to be able to load and stream Arrow datasets by @mariusz-jachimowicz-83 in https://github.com/huggingface/datasets/pull/5944
from datasets import load_dataset
ds = load_dataset("arrow", data_files={"train": "train.arrow", "test": "test.arrow"})
stopping_strategy of shuffled interleaved dataset (random cycling case) by @mariosasko in https://github.com/huggingface/datasets/pull/5816BuilderConfig by @Laurent2916 in https://github.com/huggingface/datasets/pull/5824accelerate as metric's test dependency to fix CI error by @mariosasko in https://github.com/huggingface/datasets/pull/5848date_format param to the CSV reader by @mariosasko in https://github.com/huggingface/datasets/pull/5845fn_kwargs to map and filter of IterableDataset and IterableDatasetDict by @yuukicammy in https://github.com/huggingface/datasets/pull/5810FixedSizeListArray casting by @mariosasko in https://github.com/huggingface/datasets/pull/5897DatasetBuilder.as_dataset when file_format is not "arrow" by @mariosasko in https://github.com/huggingface/datasets/pull/5915flatten_indices to DatasetDict by @maximxlss in https://github.com/huggingface/datasets/pull/5907batch_size optional, and minor improvements in Dataset.to_tf_dataset by @alvarobartt in https://github.com/huggingface/datasets/pull/5883to_numpy when None values in the sequence by @qgallouedec in https://github.com/huggingface/datasets/pull/5933Full Changelog: https://github.com/huggingface/datasets/compare/2.12.0...zef
Add Dataset.from_spark by @maddiedawson in https://github.com/huggingface/datasets/pull/5701
>>> from datasets import Dataset
>>> ds = Dataset.from_spark(df)
Support streaming Beam datasets from HF GCS preprocessed data by @albertvillanova in https://github.com/huggingface/datasets/pull/5689
>>> from datasets import load_dataset
>>> ds = load_dataset("wikipedia", "20220301.de", streaming=True)
>>> next(iter(ds["train"]))
{'id': '1', 'url': 'https://de.wikipedia.org/wiki/Alan%20Smithee', 'title': 'Alan Smithee', 'text': 'Alan Smithee steht als Pseudonym für einen fiktiven Regisseur...}
Implement sharding on merged iterable datasets by @Hubert-Bonisseur in https://github.com/huggingface/datasets/pull/5735
>>> from datasets import load_dataset, interleave_datasets
>>> from torch.utils.data import DataLoader
>>> wiki = load_dataset("wikipedia", "20220301.en", split="train", streaming=True)
>>> c4 = load_dataset("c4", "en", split="train", streaming=True)
>>> merged = interleave_datasets([wiki, c4], probabilities=[0.1, 0.9], seed=42, stopping_strategy="all_exhausted")
>>> dataloader = DataLoader(merged, num_workers=4)
Consistent ArrayND Python formatting + better NumPy/Pandas formatting by @mariosasko in https://github.com/huggingface/datasets/pull/5751
Full Changelog: https://github.com/huggingface/datasets/compare/2.11.0...2.12.0
batch_size on Dataset.to_dict()download_and_prepare() a datasetload_dataset():
from_dict by @mariosasko in https://github.com/huggingface/datasets/pull/5643ffmpeg system package installation on Colab by @polinaeterna in https://github.com/huggingface/datasets/pull/5558datasets.load_from_disk, DatasetDict.load_from_disk and Dataset.load_from_disk by @alvarobartt in https://github.com/huggingface/datasets/pull/5529huggingface_hub version to env cli command by @mariosasko in https://github.com/huggingface/datasets/pull/5578save_to_disk by @mariosasko in https://github.com/huggingface/datasets/pull/5588sort with indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/5587datasets-cli test by @lhoestq in https://github.com/huggingface/datasets/pull/5603verification_mode values by @polinaeterna in https://github.com/huggingface/datasets/pull/5607ruff by @polinaeterna in https://github.com/huggingface/datasets/pull/5636Features by @mariosasko in https://github.com/huggingface/datasets/pull/5646fsspec.open when using an HTTP proxy by @bryant1410 in https://github.com/huggingface/datasets/pull/5656Full Changelog: https://github.com/huggingface/datasets/compare/2.10.0...2.11.0
IndexError when doing ds.filter(...).sort(...) or ds.select(...).sort(...)Full Changelog: https://github.com/huggingface/datasets/compare/2.10.0...2.10.1
.flatten_indices() (x2) + save/load_from_disk (x100) on selected/shuffled datasetsverification_mode you can pass to `load_dataset()):.map() in multiprocessing.to_iterable_dataset() to get a IterableDataset from a DatasetIterableDataset in the documentation about the differences between Dataset and IterableDataset.select_column() to return a dataset only containing the requested columnsds = ds.sort(['col_1', 'col_2'], reverse=[True, False])ds = ds.with_format("jax", device=device)nyu_depth_v2 dataset by @awsaf49 in https://github.com/huggingface/datasets/pull/5484load_from_cache_file arg from Dataset.shard() docstring by @polinaeterna in https://github.com/huggingface/datasets/pull/5493NumpyFormatter by @alvarobartt in https://github.com/huggingface/datasets/pull/5530load_from_cache_file type and logic by @HallerPatrick in https://github.com/huggingface/datasets/pull/5515ruff by @mariosasko in https://github.com/huggingface/datasets/pull/5519Full Changelog: https://github.com/huggingface/datasets/compare/2.9.0...ef
Parallel implementation of to_tf_dataset() by @Rocketknight1 in https://github.com/huggingface/datasets/pull/5377
num_workers= to .to_tf_dataset() to make your dataset faster with multiprocessingDistributed support by @lhoestq in https://github.com/huggingface/datasets/pull/5369
Dataset and IterableDataset (e.g. in streaming mode)import os
from datasets.distributed import split_dataset_by_node
rank = int(os.environ["RANK"])
world_size = int(os.environ["WORLD_SIZE"])
ds = split_dataset_by_node(ds, rank=rank, world_size=world_size)
Support streaming datasets with os.path.exists and Path.exists by @albertvillanova in https://github.com/huggingface/datasets/pull/5400
Tqdm progress bar for to_parquet by @zanussbaum in https://github.com/huggingface/datasets/pull/5456
ZIP files support in iter_archive with better compression type check by @Mehdi2402 in https://github.com/huggingface/datasets/pull/3379
Support other formats than uint8 for image arrays by @vigsterkr in https://github.com/huggingface/datasets/pull/5365
fs.open resource leaks by @tkukurin in https://github.com/huggingface/datasets/pull/5358cast_to_python_objects by @mariosasko in https://github.com/huggingface/datasets/pull/5384load_dataset docstring by @mariosasko in https://github.com/huggingface/datasets/pull/5389shard_size arg from .push_to_hub() by @polinaeterna in https://github.com/huggingface/datasets/pull/5469Full Changelog: https://github.com/huggingface/datasets/compare/2.8.0...2.9.0
datasets are not able to reload datasets pushed with this new model, so we encourage everyone to update.IterableDataset.map that lead to features=None by @alvarobartt in https://github.com/huggingface/datasets/pull/5287
features after column renaming or removalfeatures param to IterableDataset.map by @alvarobartt in https://github.com/huggingface/datasets/pull/5311num_shards or max_shard_size to ds.save_to_disk() or ds.push_to_hub()num_proc to use multiprocessing.from datasets import load_dataset
ds = load_dataset("c4", "en", streaming=True, split="train")
dataloader = DataLoader(ds, batch_size=32, num_workers=4)
max_shard_size docs by @lhoestq in https://github.com/huggingface/datasets/pull/5267from_generator docs by @mariosasko in https://github.com/huggingface/datasets/pull/5307wikipedia or natural_questionsArrowWriter.finalize before inference error by @mariosasko in https://github.com/huggingface/datasets/pull/5309num_proc for dataset download and generation by @mariosasko in https://github.com/huggingface/datasets/pull/5300IterableDataset.map param batch_size typing as optional by @alvarobartt in https://github.com/huggingface/datasets/pull/5336topdown parameter in xwalk by @mariosasko in https://github.com/huggingface/datasets/pull/5308use_auth_token docstring and deprecate use_auth_token in download_and_prepare by @mariosasko in https://github.com/huggingface/datasets/pull/5302.tar archives in the same way as for .tar.gz and .tgz in _get_extraction_protocol by @polinaeterna in https://github.com/huggingface/datasets/pull/5322Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.8.0
Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.6.2
Full Changelog: https://github.com/huggingface/datasets/compare/2.7.0...2.7.1