from datasets import load_dataset
ds = load_dataset("imagenet-1k", num_proc=4)
map or filter that uses tensors or pipelines can now be cachedpyproject.toml for black by @mariosasko in https://github.com/huggingface/datasets/pull/5125tqdm zip bug by @david1542 in https://github.com/huggingface/datasets/pull/5120class_encode_column by @mariosasko in https://github.com/huggingface/datasets/pull/5130writer_batch_size by @mariosasko in https://github.com/huggingface/datasets/pull/5163co_filenames to remove by @gpucce in https://github.com/huggingface/datasets/pull/5169DownloadConfig.use_auth_token value by @alvarobartt in https://github.com/huggingface/datasets/pull/5205typer version in tests to <0.5 to fix Windows CI by @polinaeterna in https://github.com/huggingface/datasets/pull/5235Version hashable by @mariosasko in https://github.com/huggingface/datasets/pull/5238Full Changelog: https://github.com/huggingface/datasets/compare/2.6.1...2.7.0
filter could return examples with the wrong indicesmap with batch=True could return a dataset with less examplesFull Changelog: https://github.com/huggingface/datasets/compare/2.6.0...2.6.1
from datasets import Dataset
dataset = Dataset.from_sql("data_table", "sqlite:///sqlite_file.db")
from_sql + small doc improvement by @mariosasko in https://github.com/huggingface/datasets/pull/5091from datasets import Dataset
from sqlite3 import connect
con = connect(...)
dataset = Dataset.from_sql("SELECT text FROM table WHERE length(text) > 100 LIMIT 10", con)
from datasets import load_dataset
ds = load_dataset("imagenet-1k").with_format("torch") # or numpy/tf/jax
ds[0]["image"]
IterableDataset.from_generator by @hamid-vakilzadeh in https://github.com/huggingface/datasets/pull/5052kwargs to Dataset.from_generator by @mariosasko in https://github.com/huggingface/datasets/pull/5049converters in CsvBuilder by @mariosasko in https://github.com/huggingface/datasets/pull/5057load_from_disk by @asofiaoliveira in https://github.com/huggingface/datasets/pull/5073ClassLabel docstring example by @alvarobartt in https://github.com/huggingface/datasets/pull/5029flatten_indices with empty indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/5043hffs by @lhoestq in https://github.com/huggingface/datasets/pull/5101Full Changelog: https://github.com/huggingface/datasets/compare/2.5.1...2.6.0
Full Changelog: https://github.com/huggingface/datasets/compare/2.5.1...2.5.2
Full Changelog: https://github.com/huggingface/datasets/compare/2.5.0...2.5.1
!pip install evaluate
import evaluate
metric = evaluate.load("accuracy")
Dataset.from_list by @sanderland in https://github.com/huggingface/datasets/pull/4890Dataset.from_generator by @mariosasko in https://github.com/huggingface/datasets/pull/4957input_colums in Dataset.map if input_columns are specified by @mariosasko in https://github.com/huggingface/datasets/pull/4971fn_kwargs param to IterableDataset.map by @mariosasko in https://github.com/huggingface/datasets/pull/4975language_bcp47 tag by @lhoestq in https://github.com/huggingface/datasets/pull/4753Full Changelog: https://github.com/huggingface/datasets/compare/2.4.0...2.5.0
concatenate_datasets for iterable datasets by @lhoestq in https://github.com/huggingface/datasets/pull/4500metadata.jsonl from parent directories in imagefolder @mariosasko in https://github.com/huggingface/datasets/pull/4576ArrowWriter.write_batch when batch is empty by @alvarobartt in https://github.com/huggingface/datasets/pull/4510batch_size parameter when calling add_faiss_index and add_faiss_index_from_external_arrays by @alvarobartt in https://github.com/huggingface/datasets/pull/4535load_dataset by @mariosasko in https://github.com/huggingface/datasets/pull/4577_arrow_to_datasets_dtype conversion by @mariosasko in https://github.com/huggingface/datasets/pull/4628assertEqual with assertTupleEqual in unit tests for verbosity by @alvarobartt in https://github.com/huggingface/datasets/pull/4496embed_storage on features inside lists/sequences by @mariosasko in https://github.com/huggingface/datasets/pull/4615from_pandas more robust by @mariosasko in https://github.com/huggingface/datasets/pull/4703DatasetInfo/Features by @mariosasko in https://github.com/huggingface/datasets/pull/4741Full Changelog: https://github.com/huggingface/datasets/compare/2.3.2...2.4.0
/../ is passed to data_files causing FileNotFoundErrorFull Changelog: https://github.com/huggingface/datasets/compare/2.3.1...2.3.2
DownloadConfig, DownloadMode, DownloadManagerFull Changelog: https://github.com/huggingface/datasets/compare/2.3.0...2.3.1
load_dataset without requiring a manual download !load_dataset("imagenet-1k", streaming=True)train_test_split by @nandwalritik in https://github.com/huggingface/datasets/pull/4322push_to_hub: skip identical files in push_to_hub instead of overwriting by @mariosasko in https://github.com/huggingface/datasets/pull/4402features in packaged loaders by @mariosasko in https://github.com/huggingface/datasets/pull/4364scene_parse_150 card by @mariosasko in https://github.com/huggingface/datasets/pull/4447new_fingerprint by @fxmarty in https://github.com/huggingface/datasets/pull/4326iter_files by @mariosasko in https://github.com/huggingface/datasets/pull/4412dataset_infos.json with new split info in Dataset.push_to_hub to avoid verification error by @mariosasko in https://github.com/huggingface/datasets/pull/4415inspect_dataset and inspect_metric by @mariosasko in https://github.com/huggingface/datasets/pull/4433_format_columns in remove_columns by @alvarobartt in https://github.com/huggingface/datasets/pull/4411Full Changelog: https://github.com/huggingface/datasets/compare/2.2.2...lol
DatasetDict.push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/4372transformers - pinning the version to <0.3.5 for nowFull Changelog: https://github.com/huggingface/datasets/compare/2.2.1...2.2.2
datasets 2.2.0 introduced a bug in cnn_dailymail and some examples were missing in the datasetFull Changelog: https://github.com/huggingface/datasets/compare/2.2.0...2.2.1
path exists by @patrickvonplaten in https://github.com/huggingface/datasets/pull/4212imagefolder by @mariosasko in https://github.com/huggingface/datasets/pull/4069
metadata.jsonl, more info in the documentation on how to load an image datasetdata_dir parameter when loading datasets without script by @polinaeterna in https://github.com/huggingface/datasets/pull/4144
drop_last_batch to IterableDataset.map by @mariosasko in https://github.com/huggingface/datasets/pull/4215train-deval-index metadata to automate evaluation on your datasets based on their taskshuggingface_hub by @julien-c in https://github.com/huggingface/datasets/pull/4154shard_size in push_to_hub in favor of max_shard_size by @mariosasko in https://github.com/huggingface/datasets/pull/4190convert_file_size_to_int for kilobits and megabits by @mariosasko in https://github.com/huggingface/datasets/pull/4205faiss import to fix https://github.com/huggingface/datasets/issues/4287 by @alvarobartt in https://github.com/huggingface/datasets/pull/4288Full Changelog: https://github.com/huggingface/datasets/compare/2.1.0...2.2.0
facebook/multilingual_librispeechby @polinaeterna in https://github.com/huggingface/datasets/pull/4060PIL.Image file handler in Image.decode_example by @mariosasko in https://github.com/huggingface/datasets/pull/3995map remove_columns on empty dataset by @lhoestq in https://github.com/huggingface/datasets/pull/4021push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/4081cast_to_python_objects in TypedSequence by @mariosasko in https://github.com/huggingface/datasets/pull/4128Full Changelog: https://github.com/huggingface/datasets/compare/2.0.0...2.1.0
We're happy to announce that our new documentation is available at hf.co/docs/datasets !
imagefolder dataset loader:
push_to_hub:
Audio and Image feature in push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/3685IterableDataset.filter by @lhoestq in https://github.com/huggingface/datasets/pull/3826IterableDataset (rename columns, cast, etc.) by @lhoestq in https://github.com/huggingface/datasets/pull/3862to_json by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3551FaissIndex by @rentruewang in https://github.com/huggingface/datasets/pull/3721map and shuffle for datasets loaded in streaming mode:
map when streaming: update instead of overwrite + add missing parameters by @lhoestq in https://github.com/huggingface/datasets/pull/3801IterableDataset.shuffle with Dataset.shuffle by @lhoestq in https://github.com/huggingface/datasets/pull/3842remove_columns param in filter by @mariosasko in https://github.com/huggingface/datasets/pull/3827module.builder_kwargs over defaults in TestCommand by @lvwerra in https://github.com/huggingface/datasets/pull/3672Dataset.select are within bounds by @mariosasko in https://github.com/huggingface/datasets/pull/3719push_to_hub by @mariosasko in https://github.com/huggingface/datasets/pull/3732ignore_verifications is True by @mariosasko in https://github.com/huggingface/datasets/pull/3796data_dir to data_files resolution and misc improvements to HfFileSystem by @mariosasko in https://github.com/huggingface/datasets/pull/3791predictions/references in Metric.compute by @mariosasko in https://github.com/huggingface/datasets/pull/3824ignore_verifications=True by @mariosasko in https://github.com/huggingface/datasets/pull/3868Full Changelog: https://github.com/huggingface/datasets/compare/1.18.3...0.0.0
module.builder_kwargs over defaults in TestCommand #3672 (@lvwerra)Dataset.select are within bounds #3719 (@mariosasko)Full Changelog: https://github.com/huggingface/datasets/compare/1.18.3...1.18.4
get_dataset_split_names by @mariosasko in https://github.com/huggingface/datasets/pull/3657Full Changelog: https://github.com/huggingface/datasets/compare/1.18.2...1.18.3
None by @mariosasko in https://github.com/huggingface/datasets/pull/3642add_column on datasets with indices mapping by @mariosasko in https://github.com/huggingface/datasets/pull/3647Full Changelog: https://github.com/huggingface/datasets/compare/1.18.1...1.18.2
prepare_for_task() by @mariosasko in https://github.com/huggingface/datasets/pull/3614Full Changelog: https://github.com/huggingface/datasets/compare/1.18.0...1.18.1
iter_files instead of str(Path(...) in image dataset by @mariosasko in https://github.com/huggingface/datasets/pull/3477ImageClassifcation task template by @mariosasko in https://github.com/huggingface/datasets/pull/3557DuplicatedKeysError and improve card by @mariosasko in https://github.com/huggingface/datasets/pull/3559gzip for to_json by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3492preserve_index to from_pandas by @Sorrow321 in https://github.com/huggingface/datasets/pull/3565str(Path(...)) conversion in streaming on Linux by @mariosasko in https://github.com/huggingface/datasets/pull/3472pretty_name for first 200 datasets by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3498pretty_name for all the other datasets by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/3536Iterable.map call by @mariosasko in https://github.com/huggingface/datasets/pull/3556Full Changelog: https://github.com/huggingface/datasets/compare/1.17.0...1.18.0