trust_remote_code=True by @lhoestq in https://github.com/huggingface/datasets/pull/6954
trust_remote_code=True to be usedcheckpoint and resume an iterable dataset (e.g. when streaming):
>>> iterable_dataset = Dataset.from_dict({"a": range(6)}).to_iterable_dataset(num_shards=3)
>>> for idx, example in enumerate(iterable_dataset):
... print(example)
... if idx == 2:
... state_dict = iterable_dataset.state_dict()
... print("checkpoint")
... break
>>> iterable_dataset.load_state_dict(state_dict)
>>> print(f"restart from checkpoint")
>>> for example in iterable_dataset:
... print(example)
Returns:
{'a': 0}
{'a': 1}
{'a': 2}
checkpoint
restart from checkpoint
{'a': 3}
{'a': 4}
{'a': 5}
.pth support for torch tensors by @lhoestq in https://github.com/huggingface/datasets/pull/6920dataset_module_factory by @Wauplin in https://github.com/huggingface/datasets/pull/6959Full Changelog: https://github.com/huggingface/datasets/compare/2.19.0...2.20.0
Fetched April 7, 2026