v4.7.0

Datasets Features

Add Json() type by @lhoestq in https://github.com/huggingface/datasets/pull/8027
- JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the Json()type is used to store such data that would normally not be supported in Arrow/Parquet
- Use the Json() type in Features() for any dataset, it is supported in any functions that accepts features=like load_dataset(), .map(), .cast(), .from_dict(), .from_list()
- Use on_mixed_types="use_json" to automatically set the Json() type on mixed types in .from_dict(), .from_list() and .map()

Examples:

You can use on_mixed_types="use_json" or specify features= with a [Json] type:

>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
  ...
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64

>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]

This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:

>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]]  # missing fields are filled with None

>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]]  # OK

Another example with tool calling data and the on_mixed_types="use_json" argument (useful to not have to specify features= manually):

>>> messages = [
...     {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
...     {"role": "assistant", "tool_calls": [
...         {"type": "function", "function": {
...             "name": "control_light",
...             "arguments": {"room": "living room", "state": "on"}
...         }},
...         {"type": "function", "function": {
...             "name": "play_music",
...             "arguments": {"playlist": "electronic"}  # mixed-type here since keys ["playlist"] and ["room", "state"] are different
...         }}]
...     },
...     {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
...     {"role": "tool", "name": "play_music", "content": "The music is now playing."},
...     {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}

What's Changed

Fix typos in iterable_dataset.py by @omkar-334 in https://github.com/huggingface/datasets/pull/8049
Fix non-deterministic by sorting metadata extensions (#8034) by @Nexround in https://github.com/huggingface/datasets/pull/8039
Use num_examples instead of len(self) for iterable_dataset's SplitInfo by @HaukurPall in https://github.com/huggingface/datasets/pull/8041
Fix silent data loss in push_to_hub when num_proc > num_shards by @HaukurPall in https://github.com/huggingface/datasets/pull/8044
Don't extract bad files by @lhoestq in https://github.com/huggingface/datasets/pull/8056
fix(iterable_dataset): preserve features when chaining filter() on typed IterableDataset by @s-zx in https://github.com/huggingface/datasets/pull/8053
fix: handle nested null types in feature alignment for multi-proc map by @ain-soph in https://github.com/huggingface/datasets/pull/8047
Fix unstable tokenizer fingerprinting (enables map cache reuse) by @KOKOSde in https://github.com/huggingface/datasets/pull/7982
Limit dataset listing to first 20 entries in readme by @lhoestq in https://github.com/huggingface/datasets/pull/8057

New Contributors

@omkar-334 made their first contribution in https://github.com/huggingface/datasets/pull/8049
@Nexround made their first contribution in https://github.com/huggingface/datasets/pull/8039
@HaukurPall made their first contribution in https://github.com/huggingface/datasets/pull/8041
@s-zx made their first contribution in https://github.com/huggingface/datasets/pull/8053
@ain-soph made their first contribution in https://github.com/huggingface/datasets/pull/8047
@KOKOSde made their first contribution in https://github.com/huggingface/datasets/pull/7982

Full Changelog: https://github.com/huggingface/datasets/compare/4.6.1...4.7.0

Datasets Features

What's Changed

New Contributors

More from Hugging Face

More from Hugging Face