Sample without replacement option when interleaving datasets by @radulescupetru in https://github.com/huggingface/datasets/pull/7786
ds = interleave_datasets(datasets, stopping_strategy="all_exhausted_without_replacement")
Parquet: add on_bad_files argument to error/warn/skip bad files by @lhoestq in https://github.com/huggingface/datasets/pull/7806
ds = load_dataset(parquet_dataset_id, on_bad_files="warn")
Add parquet scan options and docs by @lhoestq in https://github.com/huggingface/datasets/pull/7801
ds = load_dataset(parquet_dataset_id, columns=["col_0", "col_1"])
ds = load_dataset(parquet_dataset_id, filters=[("col_0", "==", 0)])
fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))
ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)
Full Changelog: https://github.com/huggingface/datasets/compare/4.1.1...4.2.0
Fetched April 7, 2026