None handling by @mariosasko in https://github.com/huggingface/datasets/pull/3195cast_column to IterableDataset by @mariosasko in https://github.com/huggingface/datasets/pull/3439Full Changelog: https://github.com/huggingface/datasets/compare/1.16.1...1.17.0
datasets on python 3.10 by @lhoestq in https://github.com/huggingface/datasets/pull/3326Dataset and DatasetDict by @LysandreJik in https://github.com/huggingface/datasets/pull/3098:
push_to_hub() method !with_rank arg to pass process rank to map by @TevenLeScao in https://github.com/huggingface/datasets/pull/3314to_tf_dataset by @stevhliu in https://github.com/huggingface/datasets/pull/3175len(predictions) doesn't match len(references) in metrics by @mariosasko in https://github.com/huggingface/datasets/pull/3160fingerprint.py, search.py, arrow_writer.py and metric.py by @Ishan-Kumar2 in https://github.com/huggingface/datasets/pull/3305Full Changelog: https://github.com/huggingface/datasets/compare/1.15.1...1.16.0
to_csv by @bhavitvyamalik in https://github.com/huggingface/datasets/pull/2896to_tf_dataset by @Rocketknight1 in https://github.com/huggingface/datasets/pull/3085zip_dict by @mariosasko in https://github.com/huggingface/datasets/pull/3170to_tf_dataset method #2731 #2931 #2951 #2974 (@Rocketknight1)remove_columns to IterableDataset #3030 (@cccntu)get_dataset_split_names() to get a dataset config's split names #2906 (@severo)script_version parameter in load_dataset is now deprecated, in favor of revisionfilter several times in a row was not returning the right results in 1.12.0 and 1.12.1filter #2947 (@lhoestq)read_csv parameters #2960 (@SBrandeis)prepare_module function doesn't support the return_resolved_file_path and return_associated_base_path parameters. As an alternative, you may use the dataset_module_factory instead.ArrowInvalid: Can only convert 1-dimensional array values errorsSee the new documentation here !
to_json #2747 (@bhavitvyamalik)column_names showed as :func: in exploring.st #2851 (@ClementRomac)tokenize_exemple #2726 (@shabie)The error message to tell which dataset config name to load was not displayed:
Docstrings:
load_dataset("csv", data_files={"train": [url_to_one_csv_file, url_to_another_csv_file...]})
This works for all these dataset loaders:
streaming=True. Main contributions:
filter with multiprocessing in case all samples are discarded #2601 (@mxschmdt)Dataset.map #2540 (@lewtun)n>1M size tag #2527 (@lhoestq)desc parameter in map for DatasetDict object #2423 (@bhavitvyamalik)Dataset.cast can now change the feature types of Sequence fieldskeep_in_memory=True when loading a dataset to load it in memorydesc to tqdm in Dataset.map() #2374 (@bhavitvyamalik)key type and duplicates verification with hashing #2245 (@NikhilBartwal)Fix memory issue: don't copy recordbatches in memory during a table deepcopy #2291 (@lhoestq)
This affected methods like concatenate_datasets, multiprocessed map and load_from_disk.
Breaking change:
Dataset.map with the input_columns parameter, the resulting dataset will only have the columns from input_columns and the columns added by the map functions. The other columns are discarded.