Json() type by @lhoestq in https://github.com/huggingface/datasets/pull/8027
Json()type is used to store such data that would normally not be supported in Arrow/ParquetJson() type in Features() for any dataset, it is supported in any functions that accepts features=like load_dataset(), .map(), .cast(), .from_dict(), .from_list()on_mixed_types="use_json" to automatically set the Json() type on mixed types in .from_dict(), .from_list() and .map()Examples:
You can use on_mixed_types="use_json" or specify features= with a [Json] type:
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]})
Traceback (most recent call last):
...
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64
>>> features = Features({"a": Json()})
>>> ds = Dataset.from_dict({"a": [0, "foo", {"subfield": "bar"}]}, features=features)
>>> ds.features
{'a': Json()}
>>> list(ds["a"])
[0, "foo", {"subfield": "bar"}]
This is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]})
>>> ds.features
{'a': List({'b': Value('int64'), 'c': Value('int64')})}
>>> list(ds["a"])
[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]] # missing fields are filled with None
>>> features = Features({"a": List(Json())})
>>> ds = Dataset.from_dict({"a": [[{"b": 0}, {"c": 0}]]}, features=features)
>>> ds.features
{'a': List(Json())}
>>> list(ds["a"])
[[{'b': 0}, {'c': 0}]] # OK
Another example with tool calling data and the on_mixed_types="use_json" argument (useful to not have to specify features= manually):
>>> messages = [
... {"role": "user", "content": "Turn on the living room lights and play my electronic music playlist."},
... {"role": "assistant", "tool_calls": [
... {"type": "function", "function": {
... "name": "control_light",
... "arguments": {"room": "living room", "state": "on"}
... }},
... {"type": "function", "function": {
... "name": "play_music",
... "arguments": {"playlist": "electronic"} # mixed-type here since keys ["playlist"] and ["room", "state"] are different
... }}]
... },
... {"role": "tool", "name": "control_light", "content": "The lights in the living room are now on."},
... {"role": "tool", "name": "play_music", "content": "The music is now playing."},
... {"role": "assistant", "content": "Done!"}
... ]
>>> ds = Dataset.from_dict({"messages": [messages]}, on_mixed_types="use_json")
>>> ds.features
{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}
>>> ds[0][1]["tool_calls"][0]["function"]["arguments"]
{"room": "living room", "state": "on"}
Full Changelog: https://github.com/huggingface/datasets/compare/4.6.1...4.7.0
Fetched April 7, 2026