{"id":"src_LHuTKvsCXKuAjW5sqiSx2","slug":"datasets","name":"Datasets","type":"github","url":"https://github.com/huggingface/datasets","orgId":"org_GDdYeYynEgCEBNBwy-m6s","org":{"slug":"hugging-face","name":"Hugging Face"},"isPrimary":false,"metadata":"{\"evaluatedMethod\":\"github\",\"evaluatedAt\":\"2026-04-07T17:19:14.080Z\",\"changelogDetectedAt\":\"2026-04-07T17:27:53.265Z\"}","releaseCount":100,"releasesLast30Days":1,"avgReleasesPerWeek":0.6,"latestVersion":"4.8.4","latestDate":"2026-03-23T14:21:52.000Z","changelogUrl":null,"hasChangelogFile":false,"lastFetchedAt":"2026-04-19T07:02:00.809Z","trackingSince":"2021-04-30T13:20:24.000Z","releases":[{"id":"rel_5c4x99dACdlgTAC61sSJ9","version":"4.8.4","title":"4.8.4","summary":"## What's Changed\r\n* Support latest torchvision by @lhoestq in https://github.com/huggingface/datasets/pull/8087\r\n* fix regression when loading JSON w...","content":"## What's Changed\r\n* Support latest torchvision by @lhoestq in https://github.com/huggingface/datasets/pull/8087\r\n* fix regression when loading JSON with one file = one object by @lhoestq in https://github.com/huggingface/datasets/pull/8086\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.8.3...4.8.4","publishedAt":"2026-03-23T14:21:52.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.8.4","media":[]},{"id":"rel_YMq1UpC_FeWH5aInaLTg_","version":"4.8.3","title":"4.8.3","summary":"## What's Changed\r\n* Fix split_dataset_by_node step by @lhoestq in https://github.com/huggingface/datasets/pull/8081\r\n* Fix docstring of Json.cast_sto...","content":"## What's Changed\r\n* Fix split_dataset_by_node step by @lhoestq in https://github.com/huggingface/datasets/pull/8081\r\n* Fix docstring of Json.cast_storage by @albertvillanova in https://github.com/huggingface/datasets/pull/8080\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.8.2...4.8.3","publishedAt":"2026-03-19T17:44:39.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.8.3","media":[]},{"id":"rel_y8lBMSjnhQUiFZAk9rHtt","version":"4.8.2","title":"4.8.2","summary":"## What's Changed\r\n* Json type for empty struct by @lhoestq in https://github.com/huggingface/datasets/pull/8074\r\n\r\n\r\n**Full Changelog**: https://gith...","content":"## What's Changed\r\n* Json type for empty struct by @lhoestq in https://github.com/huggingface/datasets/pull/8074\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.8.1...4.8.2","publishedAt":"2026-03-17T01:10:32.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.8.2","media":[]},{"id":"rel_T3k6h4I9-zIcdjS7HAgGr","version":"4.8.1","title":"4.8.1","summary":"## What's Changed\r\n* Fix formatted iter arrow double yield by @HaukurPall in https://github.com/huggingface/datasets/pull/8063\r\n\r\n\r\n**Full Changelog**...","content":"## What's Changed\r\n* Fix formatted iter arrow double yield by @HaukurPall in https://github.com/huggingface/datasets/pull/8063\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.8.0...4.8.1","publishedAt":"2026-03-17T00:13:34.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.8.1","media":[]},{"id":"rel_zvxpmWOtEMRVHYL7e67kp","version":"4.8.0","title":"4.8.0","summary":"## Dataset Features\r\n* Read (and write) from [HF Storage Buckets](https://huggingface.co/storage): load raw data, process and save to Dataset Repos by...","content":"## Dataset Features\r\n* Read (and write) from [HF Storage Buckets](https://huggingface.co/storage): load raw data, process and save to Dataset Repos by @lhoestq in https://github.com/huggingface/datasets/pull/8064\r\n\r\n  ```python\r\n  from datasets import load_dataset\r\n  # load raw data from a Storage Bucket on HF\r\n  ds = load_dataset(\"buckets/username/data-bucket\", data_files=[\"*.jsonl\"])\r\n  # or manually, using hf:// paths\r\n  ds = load_dataset(\"json\", data_files=[\"hf://buckets/username/data-bucket/*.jsonl\"])\r\n  # process, filter\r\n  ds = ds.map(...).filter(...)\r\n  # publish the AI-ready dataset\r\n  ds.push_to_hub(\"username/my-dataset-ready-for-training\")\r\n  ```\r\n\r\n  This also fixes multiprocessed push_to_hub on macos that was causing segfault (now it uses spawn instead of fork).\r\n  And it bumps `dill` and `multiprocess` versions to support python 3.14\r\n* Datasets streaming iterable packaged improvements and fixes by @Michael-RDev in https://github.com/huggingface/datasets/pull/8068\r\n  * added `max_shard_size` to IterableDataset.push_to_hub (but requires iterating twice to know the full dataset twice - improvements are welcome)\r\n  * more arrow-native iterable operations for IterableDataset\r\n  * better support of glob patterns in archives, e.g. `zip://*.jsonl::hf://datasets/username/dataset-name/data.zip`\r\n  * fixes for to_pandas, videofolder, load_dataset_builder kwargs\r\n\r\n## What's Changed\r\n* fix reshard_data_sources by @lhoestq in https://github.com/huggingface/datasets/pull/8061\r\n* Improve error message for invalid data_files pattern format by @kushalkkb in https://github.com/huggingface/datasets/pull/8060\r\n* fix null filling in missing jsonl columns by @lhoestq in https://github.com/huggingface/datasets/pull/8069\r\n\r\n## New Contributors\r\n* @kushalkkb made their first contribution in https://github.com/huggingface/datasets/pull/8060\r\n* @Michael-RDev made their first contribution in https://github.com/huggingface/datasets/pull/8068\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.7.0...4.8.0","publishedAt":"2026-03-16T23:52:47.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.8.0","media":[]},{"id":"rel_cgQ2qPik8M7AE1zpINwiZ","version":"4.7.0","title":"4.7.0","summary":"## Datasets Features\r\n* Add `Json()` type by @lhoestq in https://github.com/huggingface/datasets/pull/8027\r\n  * JSON Lines files that contain arbitrar...","content":"## Datasets Features\r\n* Add `Json()` type by @lhoestq in https://github.com/huggingface/datasets/pull/8027\r\n  * JSON Lines files that contain arbitrary JSON objects like tool calling datasets are now supported. When there is a field or subfield containing mixed types (e.g. mix of str/int/float/dict/list or dictionaries with arbitrary keys), the `Json()`type is used to store such data that would normally not be supported in Arrow/Parquet\r\n  * Use the `Json()` type in `Features()` for any dataset, it is supported in any functions that accepts `features=`like `load_dataset()`, `.map()`, `.cast()`, `.from_dict()`, `.from_list()`\r\n  * Use `on_mixed_types=\"use_json\"` to automatically set the `Json()` type on mixed types in `.from_dict()`, `.from_list()` and `.map()`\r\n\r\nExamples:\r\n\r\nYou can use `on_mixed_types=\"use_json\"` or specify `features=` with a [`Json`] type:\r\n\r\n```python\r\n>>> ds = Dataset.from_dict({\"a\": [0, \"foo\", {\"subfield\": \"bar\"}]})\r\nTraceback (most recent call last):\r\n  ...\r\n  File \"pyarrow/error.pxi\", line 92, in pyarrow.lib.check_status\r\npyarrow.lib.ArrowInvalid: Could not convert 'foo' with type str: tried to convert to int64\r\n\r\n>>> features = Features({\"a\": Json()})\r\n>>> ds = Dataset.from_dict({\"a\": [0, \"foo\", {\"subfield\": \"bar\"}]}, features=features)\r\n>>> ds.features\r\n{'a': Json()}\r\n>>> list(ds[\"a\"])\r\n[0, \"foo\", {\"subfield\": \"bar\"}]\r\n```\r\n\r\nThis is also useful for lists of dictionaries with arbitrary keys and values, to avoid filling missing fields with None:\r\n\r\n```python\r\n>>> ds = Dataset.from_dict({\"a\": [[{\"b\": 0}, {\"c\": 0}]]})\r\n>>> ds.features\r\n{'a': List({'b': Value('int64'), 'c': Value('int64')})}\r\n>>> list(ds[\"a\"])\r\n[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]]  # missing fields are filled with None\r\n\r\n>>> features = Features({\"a\": List(Json())})\r\n>>> ds = Dataset.from_dict({\"a\": [[{\"b\": 0}, {\"c\": 0}]]}, features=features)\r\n>>> ds.features\r\n{'a': List(Json())}\r\n>>> list(ds[\"a\"])\r\n[[{'b': 0}, {'c': 0}]]  # OK\r\n```\r\n\r\nAnother example with tool calling data and the `on_mixed_types=\"use_json\"` argument (useful to not have to specify `features=` manually):\r\n\r\n```python\r\n>>> messages = [\r\n...     {\"role\": \"user\", \"content\": \"Turn on the living room lights and play my electronic music playlist.\"},\r\n...     {\"role\": \"assistant\", \"tool_calls\": [\r\n...         {\"type\": \"function\", \"function\": {\r\n...             \"name\": \"control_light\",\r\n...             \"arguments\": {\"room\": \"living room\", \"state\": \"on\"}\r\n...         }},\r\n...         {\"type\": \"function\", \"function\": {\r\n...             \"name\": \"play_music\",\r\n...             \"arguments\": {\"playlist\": \"electronic\"}  # mixed-type here since keys [\"playlist\"] and [\"room\", \"state\"] are different\r\n...         }}]\r\n...     },\r\n...     {\"role\": \"tool\", \"name\": \"control_light\", \"content\": \"The lights in the living room are now on.\"},\r\n...     {\"role\": \"tool\", \"name\": \"play_music\", \"content\": \"The music is now playing.\"},\r\n...     {\"role\": \"assistant\", \"content\": \"Done!\"}\r\n... ]\r\n>>> ds = Dataset.from_dict({\"messages\": [messages]}, on_mixed_types=\"use_json\")\r\n>>> ds.features\r\n{'messages': List({'role': Value('string'), 'content': Value('string'), 'tool_calls': List(Json()), 'name': Value('string')})}\r\n>>> ds[0][1][\"tool_calls\"][0][\"function\"][\"arguments\"]\r\n{\"room\": \"living room\", \"state\": \"on\"}\r\n```\r\n\r\n\r\n## What's Changed\r\n* Fix typos in iterable_dataset.py by @omkar-334 in https://github.com/huggingface/datasets/pull/8049\r\n* Fix non-deterministic by sorting metadata extensions (#8034) by @Nexround in https://github.com/huggingface/datasets/pull/8039\r\n* Use num_examples instead of len(self) for iterable_dataset's SplitInfo by @HaukurPall in https://github.com/huggingface/datasets/pull/8041\r\n* Fix silent data loss in push_to_hub when num_proc > num_shards by @HaukurPall in https://github.com/huggingface/datasets/pull/8044\r\n* Don't extract bad files by @lhoestq in https://github.com/huggingface/datasets/pull/8056\r\n* fix(iterable_dataset): preserve features when chaining filter() on typed IterableDataset by @s-zx in https://github.com/huggingface/datasets/pull/8053\r\n* fix: handle nested null types in feature alignment for multi-proc map by @ain-soph in https://github.com/huggingface/datasets/pull/8047\r\n* Fix unstable tokenizer fingerprinting (enables map cache reuse) by @KOKOSde in https://github.com/huggingface/datasets/pull/7982\r\n* Limit dataset listing to first 20 entries in readme by @lhoestq in https://github.com/huggingface/datasets/pull/8057\r\n\r\n## New Contributors\r\n* @omkar-334 made their first contribution in https://github.com/huggingface/datasets/pull/8049\r\n* @Nexround made their first contribution in https://github.com/huggingface/datasets/pull/8039\r\n* @HaukurPall made their first contribution in https://github.com/huggingface/datasets/pull/8041\r\n* @s-zx made their first contribution in https://github.com/huggingface/datasets/pull/8053\r\n* @ain-soph made their first contribution in https://github.com/huggingface/datasets/pull/8047\r\n* @KOKOSde made their first contribution in https://github.com/huggingface/datasets/pull/7982\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.6.1...4.7.0","publishedAt":"2026-03-09T19:09:25.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.7.0","media":[]},{"id":"rel_sU5xk32K5i_dLnNsEB4zf","version":"4.6.1","title":"4.6.1","summary":"## Bug fix\r\n* Remove tmp file in push to hub by @lhoestq in https://github.com/huggingface/datasets/pull/8030\r\n\r\n\r\n**Full Changelog**: https://github....","content":"## Bug fix\r\n* Remove tmp file in push to hub by @lhoestq in https://github.com/huggingface/datasets/pull/8030\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.6.0...4.6.1","publishedAt":"2026-02-27T23:27:26.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.6.1","media":[]},{"id":"rel_nk5yfPK6UiAkyxWXsRSq-","version":"4.6.0","title":"4.6.0","summary":"## Dataset Features\r\n* Support Image, Video and Audio types in Lance datasets\r\n  * Infer types from lance blobs by @lhoestq in https://github.com/hugg...","content":"## Dataset Features\r\n* Support Image, Video and Audio types in Lance datasets\r\n  * Infer types from lance blobs by @lhoestq in https://github.com/huggingface/datasets/pull/7966\r\n \r\n  ```python\r\n  >>> from datasets import load_dataset\r\n  >>> ds = load_dataset(\"lance-format/Openvid-1M\", streaming=True, split=\"train\")\r\n  >>> ds.features\r\n  {'video_blob': Video(),\r\n   'video_path': Value('string'),\r\n   'caption': Value('string'),\r\n   'aesthetic_score': Value('float64'),\r\n   'motion_score': Value('float64'),\r\n   'temporal_consistency_score': Value('float64'),\r\n   'camera_motion': Value('string'),\r\n   'frame': Value('int64'),\r\n   'fps': Value('float64'),\r\n   'seconds': Value('float64'),\r\n   'embedding': List(Value('float32'), length=1024)}\r\n  ```\r\n* Push to hub now supports Video types\r\n  * push_to_hub() for videos by @lhoestq in https://github.com/huggingface/datasets/pull/7971\r\n  \r\n  ```python\r\n   >>> from datasets import Dataset, Video\r\n  >>> ds = Dataset.from_dict({\"video\": [\"path/to/video.mp4\"]})\r\n  >>> ds = ds.cast_column(\"video\", Video())\r\n  >>> ds.push_to_hub(\"username/my-video-dataset\")\r\n  ```\r\n* Write image/audio/video blobs as is in parquet (PLAIN) in `push_to_hub()` by @lhoestq in https://github.com/huggingface/datasets/pull/7976\r\n  * this enables cross-format Xet deduplication for image/audio/video, e.g. deduplicate videos between Lance, WebDataset, Parquet files and plain video files and make downloads and uploads faster to Hugging Face\r\n  * E.g. if you convert a Lance video dataset to a Parquet video dataset on Hugging Face, the upload will be much faster since videos don't need to be reuploaded. Under the hood, the Xet storage reuses the binary chunks from the videos in Lance format for the videos in Parquet format\r\n  * See more info here: https://huggingface.co/docs/hub/en/xet/deduplication\r\n  \r\n<p align=\"center\">\r\n<a href=\"https://huggingface.co/docs/hub/en/xet/deduplication\">\r\n<img height=\"200\" alt=\"image\" src=\"https://github.com/user-attachments/assets/dd0de6a2-24a1-4945-8d25-44b763c1151e\" />\r\n</a>\r\n</p>\r\n\r\n* Add `IterableDataset.reshard()` by @lhoestq in https://github.com/huggingface/datasets/pull/7992\r\n\r\n  Reshard the dataset if possible, i.e. split the current shards further into more shards.\r\n  This increases the number of shards and the resulting dataset has num_shards >= previous_num_shards.\r\n  Equality may happen if no shard can be split further.\r\n\r\n  The resharding mechanism depends on the dataset file format:\r\n\r\n    * Parquet: shard per row group instead of per file\r\n    * Other: not implemented yet (contributions are welcome !)\r\n\r\n  ```python\r\n  >>> from datasets import load_dataset\r\n  >>> ds = load_dataset(\"fancyzhx/amazon_polarity\", split=\"train\", streaming=True)\r\n  >>> ds\r\n  IterableDataset({\r\n      features: ['label', 'title', 'content'],\r\n      num_shards: 4\r\n  })\r\n  >>> ds.reshard()\r\n  IterableDataset({\r\n      features: ['label', 'title', 'content'],\r\n      num_shards: 3600\r\n  })\r\n  ```\r\n\r\n## What's Changed\r\n* Fix load_from_disk progress bar with redirected stdout by @omarfarhoud in https://github.com/huggingface/datasets/pull/7919\r\n* Revert \"feat: avoid some copies in torch formatter (#7787)\" by @lhoestq in https://github.com/huggingface/datasets/pull/7961\r\n* docs: fix grammar and add type hints in splits.py by @Edge-Explorer in https://github.com/huggingface/datasets/pull/7960\r\n* Fix interleave_datasets with all_exhausted_without_replacement strategy by @prathamk-tw in https://github.com/huggingface/datasets/pull/7955\r\n* Add examples for Lance datasets by @prrao87 in https://github.com/huggingface/datasets/pull/7950\r\n* Support null in json string cols by @lhoestq in https://github.com/huggingface/datasets/pull/7963\r\n* handle blob lance by @lhoestq in https://github.com/huggingface/datasets/pull/7964\r\n* Count examples in lance by @lhoestq in https://github.com/huggingface/datasets/pull/7969\r\n* Use temp files in push_to_hub to save memory by @lhoestq in https://github.com/huggingface/datasets/pull/7979\r\n* Drop python 3.9 by @lhoestq in https://github.com/huggingface/datasets/pull/7980\r\n* Support pandas 3 by @lhoestq in https://github.com/huggingface/datasets/pull/7981\r\n* Remove unused data files optims by @lhoestq in https://github.com/huggingface/datasets/pull/7985\r\n* Remove pre-release workaround in CI for `transformers v5` and `huggingface_hub v1` by @hanouticelina in https://github.com/huggingface/datasets/pull/7989\r\n* very basic support for more hf urls by @lhoestq in https://github.com/huggingface/datasets/pull/8003\r\n* Bump fsspec upper bound to 2026.2.0 (fixes #7994) by @jayzuccarelli in https://github.com/huggingface/datasets/pull/7995\r\n* Fix: make environment variable naming consistent (issue #7998) by @AnkitAhlawat7742 in https://github.com/huggingface/datasets/pull/8000\r\n* More IterableDataset.from_x methods and docs and polars.Lazyframe support by @lhoestq in https://github.com/huggingface/datasets/pull/8009\r\n* Support empty shard in from_generator by @lhoestq in https://github.com/huggingface/datasets/pull/8023\r\n* Allow import polars in map() by @lhoestq in https://github.com/huggingface/datasets/pull/8024\r\n\r\n## New Contributors\r\n* @omarfarhoud made their first contribution in https://github.com/huggingface/datasets/pull/7919\r\n* @Edge-Explorer made their first contribution in https://github.com/huggingface/datasets/pull/7960\r\n* @prathamk-tw made their first contribution in https://github.com/huggingface/datasets/pull/7955\r\n* @prrao87 made their first contribution in https://github.com/huggingface/datasets/pull/7950\r\n* @hanouticelina made their first contribution in https://github.com/huggingface/datasets/pull/7989\r\n* @jayzuccarelli made their first contribution in https://github.com/huggingface/datasets/pull/7995\r\n* @AnkitAhlawat7742 made their first contribution in https://github.com/huggingface/datasets/pull/8000\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.5.0...4.6.0","publishedAt":"2026-02-25T12:15:45.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.6.0","media":[]},{"id":"rel_5YtooSqovqtcnuRgodmpV","version":"4.5.0","title":"4.5.0","summary":"## Dataset Features\r\n* Add lance format support by @eddyxu in https://github.com/huggingface/datasets/pull/7913\r\n  * Support for both Lance dataset (i...","content":"## Dataset Features\r\n* Add lance format support by @eddyxu in https://github.com/huggingface/datasets/pull/7913\r\n  * Support for both Lance dataset (including metadata / manifests) and standalone .lance files\r\n  * e.g. with [lance-format/fineweb-edu](https://huggingface.co/datasets/lance-format/fineweb-edu)\r\n\r\n  ```python\r\n  from datasets import load_dataset\r\n\r\n  ds = load_dataset(\"lance-format/fineweb-edu\", streaming=True)\r\n  for example in ds[\"train\"]:\r\n      ...\r\n  ```\r\n\r\n## What's Changed\r\n\r\n* Raise early for invalid `revision` in `load_dataset` by @Scott-Simmons in https://github.com/huggingface/datasets/pull/7929\r\n* fix low but large example indexerror by @CloseChoice in https://github.com/huggingface/datasets/pull/7912\r\n* Fix method to retrieve attributes from file object by @lhoestq in https://github.com/huggingface/datasets/pull/7938\r\n* add _OverridableIOWrapper by @lhoestq in https://github.com/huggingface/datasets/pull/7942\r\n* Add _generate_shards by @lhoestq in https://github.com/huggingface/datasets/pull/7943\r\n\r\n## New Contributors\r\n* @eddyxu made their first contribution in https://github.com/huggingface/datasets/pull/7913\r\n* @Scott-Simmons made their first contribution in https://github.com/huggingface/datasets/pull/7929\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.4.2...4.5.0","publishedAt":"2026-01-14T18:33:15.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.5.0","media":[]},{"id":"rel_Bx5Qz-1RiZF6lVY4eT9jf","version":"4.4.2","title":"4.4.2","summary":"## Bug fixes\r\n* Fix embed storage nifti by @CloseChoice in https://github.com/huggingface/datasets/pull/7853\r\n* ArXiv -> HF Papers by @qgallouedec in ...","content":"## Bug fixes\r\n* Fix embed storage nifti by @CloseChoice in https://github.com/huggingface/datasets/pull/7853\r\n* ArXiv -> HF Papers by @qgallouedec in https://github.com/huggingface/datasets/pull/7855\r\n* fix some broken links by @julien-c in https://github.com/huggingface/datasets/pull/7859\r\n* Nifti visualization support by @CloseChoice in https://github.com/huggingface/datasets/pull/7874\r\n* Replace papaya with niivue by @CloseChoice in https://github.com/huggingface/datasets/pull/7878\r\n* Fix 7846: add_column and add_item erroneously(?) require new_fingerprint parameter  by @sajmaru in https://github.com/huggingface/datasets/pull/7884\r\n* fix(fingerprint): treat TMPDIR as strict API and fail (Issue #7877) by @ada-ggf25 in https://github.com/huggingface/datasets/pull/7891\r\n* encode nifti correctly when uploading lazily by @CloseChoice in https://github.com/huggingface/datasets/pull/7892\r\n* fix(nifti): enable lazy loading for Nifti1ImageWrapper by @The-Obstacle-Is-The-Way in https://github.com/huggingface/datasets/pull/7887\r\n\r\n## Minor additions\r\n* Add type overloads to load_dataset for better static type inference by @Aditya2755 in https://github.com/huggingface/datasets/pull/7888\r\n* Add inspect_ai eval logs support by @lhoestq in https://github.com/huggingface/datasets/pull/7899\r\n* Save input shard lengths by @lhoestq in https://github.com/huggingface/datasets/pull/7897\r\n* Don't save original_shard_lengths by default for backward compat by @lhoestq in https://github.com/huggingface/datasets/pull/7906\r\n\r\n## New Contributors\r\n* @sajmaru made their first contribution in https://github.com/huggingface/datasets/pull/7884\r\n* @Aditya2755 made their first contribution in https://github.com/huggingface/datasets/pull/7888\r\n* @ada-ggf25 made their first contribution in https://github.com/huggingface/datasets/pull/7891\r\n* @The-Obstacle-Is-The-Way made their first contribution in https://github.com/huggingface/datasets/pull/7887\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.4.1...4.4.2","publishedAt":"2025-12-19T15:05:34.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.4.2","media":[]},{"id":"rel_Mv1zR4COOFCP4-znatAvV","version":"4.4.1","title":"4.4.1","summary":"## Bug fixes and improvements\r\n* Better streaming retries (504 and 429) by @lhoestq in https://github.com/huggingface/datasets/pull/7847\r\n* DOC: remov...","content":"## Bug fixes and improvements\r\n* Better streaming retries (504 and 429) by @lhoestq in https://github.com/huggingface/datasets/pull/7847\r\n* DOC: remove mode parameter in docstring of pdf and video feature by @CloseChoice in https://github.com/huggingface/datasets/pull/7848\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.4.0...4.4.1","publishedAt":"2025-11-05T16:01:38.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.4.1","media":[]},{"id":"rel_WpwjgJGnaciB1qUlHACRG","version":"4.4.0","title":"4.4.0","summary":"## Dataset Features\r\n* Add nifti support by @CloseChoice in https://github.com/huggingface/datasets/pull/7815\r\n  * Load medical imaging datasets from ...","content":"## Dataset Features\r\n* Add nifti support by @CloseChoice in https://github.com/huggingface/datasets/pull/7815\r\n  * Load medical imaging datasets from Hugging Face:\r\n\r\n  ```python\r\n  ds = load_dataset(\"username/my_nifti_dataset\")\r\n  ds[\"train\"][0]  # {\"nifti\": <nibabel.nifti1.Nifti1Image>}\r\n  ```\r\n  * Load medical imaging datasets from your disk:\r\n\r\n  ```python\r\n  files = [\"/path/to/scan_001.nii.gz\", \"/path/to/scan_002.nii.gz\"]\r\n  ds = Dataset.from_dict({\"nifti\": files}).cast_column(\"nifti\", Nifti())\r\n  ds[\"train\"][0]  # {\"nifti\": <nibabel.nifti1.Nifti1Image>}\r\n  ```\r\n\r\n  * Documentation: https://huggingface.co/docs/datasets/nifti_dataset\r\n* Add num channels to audio by @CloseChoice in https://github.com/huggingface/datasets/pull/7840\r\n\r\n```python\r\n# samples have shape (num_channels, num_samples)\r\nds = ds.cast_column(\"audio\", Audio())  # default, use all channels\r\nds = ds.cast_column(\"audio\", Audio(num_channels=2))  # use stereo\r\nds = ds.cast_column(\"audio\", Audio(num_channels=1))  # use mono\r\n```\r\n\r\n* Python 3.14 support by @lhoestq in https://github.com/huggingface/datasets/pull/7836\r\n\r\n## What's Changed\r\n* Fix random seed on shuffle and interleave_datasets by @CloseChoice in https://github.com/huggingface/datasets/pull/7823\r\n* fix ci compressionfs by @lhoestq in https://github.com/huggingface/datasets/pull/7830\r\n* fix: better args passthrough for `_batch_setitems()` by @sghng in https://github.com/huggingface/datasets/pull/7817\r\n* Fix: Properly render [!TIP] block in stream.shuffle documentation by @art-test-stack in https://github.com/huggingface/datasets/pull/7833\r\n* resolves the ValueError: Unable to avoid copy while creating an array by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7831\r\n* fix column with transform by @lhoestq in https://github.com/huggingface/datasets/pull/7843\r\n* support fsspec 2025.10.0 by @lhoestq in https://github.com/huggingface/datasets/pull/7844\r\n\r\n## New Contributors\r\n* @sghng made their first contribution in https://github.com/huggingface/datasets/pull/7817\r\n* @art-test-stack made their first contribution in https://github.com/huggingface/datasets/pull/7833\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.3.0...4.4.0","publishedAt":"2025-11-04T10:42:47.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.4.0","media":[]},{"id":"rel_m6oVi5hQSXM6DNQ4GIfsL","version":"4.3.0","title":"4.3.0","summary":"## Dataset Features\r\n\r\nEnable large scale distributed dataset streaming:\r\n\r\n* Keep hffs cache in workers when streaming by @lhoestq in https://github....","content":"## Dataset Features\r\n\r\nEnable large scale distributed dataset streaming:\r\n\r\n* Keep hffs cache in workers when streaming by @lhoestq in https://github.com/huggingface/datasets/pull/7820\r\n* Retry open hf file by @lhoestq in https://github.com/huggingface/datasets/pull/7822\r\n\r\nThese improvements require `huggingface_hub>=1.1.0` to take full effect\r\n\r\n## What's Changed\r\n* fix conda deps by @lhoestq in https://github.com/huggingface/datasets/pull/7810\r\n* Add pyarrow's binary view to features by @delta003 in https://github.com/huggingface/datasets/pull/7795\r\n* Fix polars cast column image by @CloseChoice in https://github.com/huggingface/datasets/pull/7800\r\n* Allow streaming hdf5 files by @lhoestq in https://github.com/huggingface/datasets/pull/7814\r\n* Fix batch_size default description in to_polars docstrings by @albertvillanova in https://github.com/huggingface/datasets/pull/7824\r\n* docs: document_dataset PDFs & OCR by @ethanknights in https://github.com/huggingface/datasets/pull/7812\r\n* Add custom fingerprint support to `from_generator` by @simonreise in https://github.com/huggingface/datasets/pull/7533\r\n* picklable batch_fn by @lhoestq in https://github.com/huggingface/datasets/pull/7826\r\n\r\n## New Contributors\r\n* @delta003 made their first contribution in https://github.com/huggingface/datasets/pull/7795\r\n* @CloseChoice made their first contribution in https://github.com/huggingface/datasets/pull/7800\r\n* @ethanknights made their first contribution in https://github.com/huggingface/datasets/pull/7812\r\n* @simonreise made their first contribution in https://github.com/huggingface/datasets/pull/7533\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.2.0...4.3.0","publishedAt":"2025-10-23T16:33:59.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.3.0","media":[]},{"id":"rel_sbeaU_SPwRVvjuTTnjfHL","version":"4.2.0","title":"4.2.0","summary":"## Dataset Features\r\n* Sample without replacement option when interleaving datasets by @radulescupetru in https://github.com/huggingface/datasets/pull...","content":"## Dataset Features\r\n* Sample without replacement option when interleaving datasets by @radulescupetru in https://github.com/huggingface/datasets/pull/7786\r\n\r\n  ```python\r\n  ds = interleave_datasets(datasets, stopping_strategy=\"all_exhausted_without_replacement\")\r\n  ```\r\n\r\n* Parquet: add `on_bad_files` argument to error/warn/skip bad files by @lhoestq in https://github.com/huggingface/datasets/pull/7806\r\n\r\n  ```python\r\n  ds = load_dataset(parquet_dataset_id, on_bad_files=\"warn\")\r\n  ```\r\n\r\n* Add parquet scan options and docs by @lhoestq in https://github.com/huggingface/datasets/pull/7801\r\n\r\n  * docs to select columns and filter data efficiently\r\n\r\n  ```python\r\n  ds = load_dataset(parquet_dataset_id, columns=[\"col_0\", \"col_1\"])\r\n  ds = load_dataset(parquet_dataset_id, filters=[(\"col_0\", \"==\", 0)])\r\n  ```\r\n  * new argument to control buffering and caching when streaming\r\n\r\n  ```python\r\n  fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 << 20))\r\n  ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)\r\n  ```\r\n\r\n## What's Changed\r\n* Document HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7740\r\n* update tips in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7790\r\n* feat: avoid some copies in torch formatter by @drbh in https://github.com/huggingface/datasets/pull/7787\r\n* Support huggingface_hub v0.x and v1.x by @Wauplin in https://github.com/huggingface/datasets/pull/7783\r\n* Define CI future by @lhoestq in https://github.com/huggingface/datasets/pull/7799\r\n* More Parquet streaming docs by @lhoestq in https://github.com/huggingface/datasets/pull/7803\r\n* Less api calls when resolving data_files by @lhoestq in https://github.com/huggingface/datasets/pull/7805\r\n* typo by @lhoestq in https://github.com/huggingface/datasets/pull/7807\r\n\r\n## New Contributors\r\n* @drbh made their first contribution in https://github.com/huggingface/datasets/pull/7787\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.1.1...4.2.0","publishedAt":"2025-10-09T16:18:22.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.2.0","media":[]},{"id":"rel_73cXMXtnUYvt-KVpXjyP0","version":"4.1.1","title":"4.1.1","summary":"## What's Changed\r\n* fix iterate nested field by @lhoestq in https://github.com/huggingface/datasets/pull/7775\r\n* Add support for arrow iterable when ...","content":"## What's Changed\r\n* fix iterate nested field by @lhoestq in https://github.com/huggingface/datasets/pull/7775\r\n* Add support for arrow iterable when concatenating or interleaving by @radulescupetru in https://github.com/huggingface/datasets/pull/7771\r\n* fix empty dataset to_parquet by @lhoestq in https://github.com/huggingface/datasets/pull/7779\r\n\r\n## New Contributors\r\n* @radulescupetru made their first contribution in https://github.com/huggingface/datasets/pull/7771\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.1.0...4.1.1","publishedAt":"2025-09-18T13:15:08.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.1.1","media":[]},{"id":"rel_1i26Tqgd1omeiiKqpPI08","version":"4.1.0","title":"4.1.0","summary":"## Dataset Features\r\n* feat: use content defined chunking  by @kszucs in https://github.com/huggingface/datasets/pull/7589\r\n  * Parquet datasets are n...","content":"## Dataset Features\r\n* feat: use content defined chunking  by @kszucs in https://github.com/huggingface/datasets/pull/7589\r\n  * Parquet datasets are now [Optimized Parquet](https://huggingface.co/docs/hub/datasets-libraries#optimized-parquet-files) !\r\n   <img width=\"462\" height=\"103\" alt=\"image\" src=\"https://github.com/user-attachments/assets/43703a47-0964-421b-8f01-1a790305de79\" />\r\n\r\n  * internally uses `use_content_defined_chunking=True` when writing Parquet files\r\n  * this enables fast deduped uploads to Hugging Face !\r\n  \r\n  ```python\r\n  # Now faster thanks to content defined chunking\r\n  ds.push_to_hub(\"username/dataset_name\")\r\n  ```\r\n  * this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file / version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.\r\n  * with this change, the new default row group size for Parquet is set to 100MB\r\n  * `write_page_index=True` is also used to enable fast random access for the Dataset Viewer and tools that need it\r\n* Concurrent push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7708\r\n* Concurrent IterableDataset push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7710\r\n* HDF5 support by @klamike in https://github.com/huggingface/datasets/pull/7690\r\n  * load HDF5 datasets in one line of code\r\n  ```python\r\n  ds = load_dataset(\"username/dataset-with-hdf5-files\")\r\n  ```\r\n  * each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows\r\n\r\n## Other improvements and bug fixes\r\n* Convert to string when needed + faster .zstd by @lhoestq in https://github.com/huggingface/datasets/pull/7683\r\n* fix audio cast storage from array + sampling_rate by @lhoestq in https://github.com/huggingface/datasets/pull/7684\r\n* Fix misleading add_column() usage example in docstring by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7648\r\n* Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in https://github.com/huggingface/datasets/pull/7438\r\n* Update fsspec max version to current release 2025.7.0 by @rootAvish in https://github.com/huggingface/datasets/pull/7701\r\n* Update dataset_dict push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7711\r\n* Retry intermediate commits too by @lhoestq in https://github.com/huggingface/datasets/pull/7712\r\n* num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in https://github.com/huggingface/datasets/pull/7702\r\n* Update cli.mdx to refer to the new \"hf\" CLI by @evalstate in https://github.com/huggingface/datasets/pull/7713\r\n* fix num_proc=1 ci test by @lhoestq in https://github.com/huggingface/datasets/pull/7714\r\n* Docs: Use Image(mode=\"F\") for PNG/JPEG depth maps  by @lhoestq in https://github.com/huggingface/datasets/pull/7715\r\n* typo by @lhoestq in https://github.com/huggingface/datasets/pull/7716\r\n* fix largelist repr by @lhoestq in https://github.com/huggingface/datasets/pull/7735\r\n* Grammar fix: correct \"showed\" to \"shown\" in fingerprint.py by @brchristian in https://github.com/huggingface/datasets/pull/7730\r\n* Fix type hint `train_test_split` by @qgallouedec in https://github.com/huggingface/datasets/pull/7736\r\n* fix(webdataset): don't .lower() field_name by @YassineYousfi in https://github.com/huggingface/datasets/pull/7726\r\n* Refactor HDF5 and preserve tree structure by @klamike in https://github.com/huggingface/datasets/pull/7743\r\n* docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in https://github.com/huggingface/datasets/pull/7737\r\n* Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in https://github.com/huggingface/datasets/pull/7761\r\n* Support pathlib.Path for feature input by @Joshua-Chin in https://github.com/huggingface/datasets/pull/7755\r\n* add support for pyarrow string view in features by @onursatici in https://github.com/huggingface/datasets/pull/7718\r\n* Fix typo in error message for cache directory deletion by @brchristian in https://github.com/huggingface/datasets/pull/7749\r\n* update torchcodec in ci by @lhoestq in https://github.com/huggingface/datasets/pull/7764\r\n* Bump dill to 0.4.0 by @Bomme in https://github.com/huggingface/datasets/pull/7763\r\n\r\n## New Contributors\r\n* @DavidRConnell made their first contribution in https://github.com/huggingface/datasets/pull/7438\r\n* @rootAvish made their first contribution in https://github.com/huggingface/datasets/pull/7701\r\n* @tanuj-rai made their first contribution in https://github.com/huggingface/datasets/pull/7702\r\n* @evalstate made their first contribution in https://github.com/huggingface/datasets/pull/7713\r\n* @brchristian made their first contribution in https://github.com/huggingface/datasets/pull/7730\r\n* @klamike made their first contribution in https://github.com/huggingface/datasets/pull/7690\r\n* @YassineYousfi made their first contribution in https://github.com/huggingface/datasets/pull/7726\r\n* @Sanjaykumar030 made their first contribution in https://github.com/huggingface/datasets/pull/7737\r\n* @kszucs made their first contribution in https://github.com/huggingface/datasets/pull/7589\r\n* @Joshua-Chin made their first contribution in https://github.com/huggingface/datasets/pull/7755\r\n* @onursatici made their first contribution in https://github.com/huggingface/datasets/pull/7718\r\n* @Bomme made their first contribution in https://github.com/huggingface/datasets/pull/7763\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/4.0.0...4.1.0","publishedAt":"2025-09-15T16:41:46.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.1.0","media":[]},{"id":"rel_NLmKOkWvV98CCUyozeZj8","version":"4.0.0","title":"4.0.0","summary":"## New Features\r\n* Add `IterableDataset.push_to_hub()` by @lhoestq in https://github.com/huggingface/datasets/pull/7595\r\n\r\n  ```python\r\n  # Build stre...","content":"## New Features\r\n* Add `IterableDataset.push_to_hub()` by @lhoestq in https://github.com/huggingface/datasets/pull/7595\r\n\r\n  ```python\r\n  # Build streaming data pipelines in a few lines of code !\r\n  from datasets import load_dataset\r\n\r\n  ds = load_dataset(..., streaming=True)\r\n  ds = ds.map(...).filter(...)\r\n  ds.push_to_hub(...)\r\n  ```\r\n\r\n* Add `num_proc=` to `.push_to_hub()` (Dataset and IterableDataset) by @lhoestq in https://github.com/huggingface/datasets/pull/7606\r\n\r\n  ```python\r\n  # Faster push to Hub ! Available for both Dataset and IterableDataset\r\n  ds.push_to_hub(..., num_proc=8)\r\n  ```\r\n\r\n* New `Column` object\r\n  - Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in https://github.com/huggingface/datasets/pull/7564\r\n  - Lazy column by @lhoestq in https://github.com/huggingface/datasets/pull/7614\r\n\r\n  ```python\r\n  # Syntax:\r\n  ds[\"column_name\"]  # datasets.Column([...]) or datasets.IterableColumn(...)\r\n\r\n  # Iterate on a column:\r\n  for text in ds[\"text\"]:\r\n      ...\r\n\r\n  # Load one cell without bringing the full column in memory\r\n  first_text = ds[\"text\"][0]  # equivalent to ds[0][\"text\"]\r\n  ```\r\n* Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616\r\n  - Enables streaming only the ranges you need ! \r\n\r\n  ```python\r\n  # Don't download full audios/videos when it's not necessary\r\n  # Now with torchcodec it only streams the required ranges/frames:\r\n  from datasets import load_dataset\r\n\r\n  ds = load_dataset(..., streaming=True)\r\n  for example in ds:\r\n      video = example[\"video\"]\r\n      frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames\r\n  ```\r\n\r\n  - Requires `torch>=2.7.0` and FFmpeg >= 4\r\n  - Not available for Windows yet but it is [coming soon](https://github.com/pytorch/torchcodec/issues/640) - in the meantime please use `datasets<4.0`\r\n  - Load audio data with `AudioDecoder`:\r\n\r\n  ```python\r\n  audio = dataset[0][\"audio\"]  # <datasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>\r\n  samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)\r\n  samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]\r\n  samples.sample_rate  # 16000\r\n\r\n  # old syntax is still supported\r\n  array, sr = audio[\"array\"], audio[\"sampling_rate\"]\r\n  ```\r\n\r\n  - Load video data with `VideoDecoder`:\r\n\r\n  ```python\r\n  video = dataset[0][\"video\"] <torchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>\r\n  first_frame = video.get_frame_at(0)\r\n  first_frame.data.shape  # (3, 240, 320)\r\n  first_frame.pts_seconds  # 0.0\r\n  frames = video.get_frames_in_range(0, 6, 1)\r\n  frames.data.shape  # torch.Size([5, 3, 240, 320])\r\n  ```\r\n\r\n## Breaking changes\r\n* Remove scripts altogether by @lhoestq in https://github.com/huggingface/datasets/pull/7592\r\n  - `trust_remote_code` is no longer supported \r\n* Torchcodec decoding by @TyTodd in https://github.com/huggingface/datasets/pull/7616\r\n  - torchcodec replaces soundfile for audio decoding\r\n  - torchcodec replaces decord for video decoding\r\n* Replace Sequence by List by @lhoestq in https://github.com/huggingface/datasets/pull/7634\r\n  - Introduction of the `List` type\r\n\r\n  ```python\r\n  from datasets import Features, List, Value\r\n\r\n  features = Features({\r\n      \"texts\": List(Value(\"string\")),\r\n      \"four_paragraphs\": List(Value(\"string\"), length=4)\r\n  })\r\n  ```\r\n\r\n  - `Sequence` was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a `List` or a `dict` depending on the subfeature\r\n\r\n  ```python\r\n  from datasets import Sequence\r\n\r\n  Sequence(Value(\"string\"))  # List(Value(\"string\"))\r\n  Sequence({\"texts\": Value(\"string\")})  # {\"texts\": List(Value(\"string\"))}\r\n  ```\r\n\r\n## Other improvements and bug fixes\r\n* Refactor `Dataset.map` to reuse cache files mapped with different `num_proc` by @ringohoffman in https://github.com/huggingface/datasets/pull/7434\r\n* fix string_to_dict test by @lhoestq in https://github.com/huggingface/datasets/pull/7571\r\n* Preserve formatting in concatenated IterableDataset by @francescorubbo in https://github.com/huggingface/datasets/pull/7522\r\n* Fix typos in PDF and Video documentation by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7579\r\n* fix: Add embed_storage in Pdf feature by @AndreaFrancis in https://github.com/huggingface/datasets/pull/7582\r\n* load_dataset splits typing by @lhoestq in https://github.com/huggingface/datasets/pull/7587\r\n* Fixed typos by @TopCoder2K in https://github.com/huggingface/datasets/pull/7572\r\n* Fix regex library warnings by @emmanuel-ferdman in https://github.com/huggingface/datasets/pull/7576\r\n* [MINOR:TYPO] Update save_to_disk docstring by @cakiki in https://github.com/huggingface/datasets/pull/7575\r\n* Add missing property on `RepeatExamplesIterable` by @SilvanCodes in https://github.com/huggingface/datasets/pull/7581\r\n* Avoid multiple default config names by @albertvillanova in https://github.com/huggingface/datasets/pull/7585\r\n* Fix broken link to albumentations by @ternaus in https://github.com/huggingface/datasets/pull/7593\r\n* fix string_to_dict usage for windows by @lhoestq in https://github.com/huggingface/datasets/pull/7598\r\n* No TF in win tests by @lhoestq in https://github.com/huggingface/datasets/pull/7603\r\n* Docs and more methods for IterableDataset: push_to_hub, to_parquet... by @lhoestq in https://github.com/huggingface/datasets/pull/7604\r\n* Tests typing and fixes for push_to_hub by @lhoestq in https://github.com/huggingface/datasets/pull/7608\r\n* fix parallel push_to_hub in dataset_dict by @lhoestq in https://github.com/huggingface/datasets/pull/7613\r\n* remove unused code by @lhoestq in https://github.com/huggingface/datasets/pull/7615\r\n* Update `_dill.py` to use `co_linetable` for Python 3.10+ in place of `co_lnotab` by @qgallouedec in https://github.com/huggingface/datasets/pull/7609\r\n* Fixes in docs by @lhoestq in https://github.com/huggingface/datasets/pull/7620\r\n* Add albumentations to use dataset by @ternaus in https://github.com/huggingface/datasets/pull/7596\r\n* minor docs data aug by @lhoestq in https://github.com/huggingface/datasets/pull/7621\r\n* fix: raise error in FolderBasedBuilder when data_dir and data_files are missing by @ArjunJagdale in https://github.com/huggingface/datasets/pull/7623\r\n* fix save_infos by @lhoestq in https://github.com/huggingface/datasets/pull/7639\r\n* better features repr by @lhoestq in https://github.com/huggingface/datasets/pull/7640\r\n* update docs and docstrings by @lhoestq in https://github.com/huggingface/datasets/pull/7641\r\n* fix length for ci by @lhoestq in https://github.com/huggingface/datasets/pull/7642\r\n* Backward compat sequence instance by @lhoestq in https://github.com/huggingface/datasets/pull/7643\r\n* fix sequence ci by @lhoestq in https://github.com/huggingface/datasets/pull/7644\r\n* Custom metadata filenames by @lhoestq in https://github.com/huggingface/datasets/pull/7663\r\n* Update the beans dataset link in Preprocess by @HJassar in https://github.com/huggingface/datasets/pull/7659\r\n* Backward compat list feature by @lhoestq in https://github.com/huggingface/datasets/pull/7666\r\n* Fix infer list of images by @lhoestq in https://github.com/huggingface/datasets/pull/7667\r\n* Fix audio bytes by @lhoestq in https://github.com/huggingface/datasets/pull/7670\r\n* Fix double sequence by @lhoestq in https://github.com/huggingface/datasets/pull/7672\r\n\r\n## New Contributors\r\n* @TopCoder2K made their first contribution in https://github.com/huggingface/datasets/pull/7564\r\n* @francescorubbo made their first contribution in https://github.com/huggingface/datasets/pull/7522\r\n* @emmanuel-ferdman made their first contribution in https://github.com/huggingface/datasets/pull/7576\r\n* @SilvanCodes made their first contribution in https://github.com/huggingface/datasets/pull/7581\r\n* @ternaus made their first contribution in https://github.com/huggingface/datasets/pull/7593\r\n* @ArjunJagdale made their first contribution in https://github.com/huggingface/datasets/pull/7623\r\n* @TyTodd made their first contribution in https://github.com/huggingface/datasets/pull/7616\r\n* @HJassar made their first contribution in https://github.com/huggingface/datasets/pull/7659\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/3.6.0...4.0.0","publishedAt":"2025-07-09T14:54:50.000Z","url":"https://github.com/huggingface/datasets/releases/tag/4.0.0","media":[]},{"id":"rel_-LaQzPozFHUy4avicjdPk","version":"3.6.0","title":"3.6.0","summary":"## Dataset Features\r\n* Enable xet in push to hub by @lhoestq in https://github.com/huggingface/datasets/pull/7552\r\n  * Faster downloads/uploads with X...","content":"## Dataset Features\r\n* Enable xet in push to hub by @lhoestq in https://github.com/huggingface/datasets/pull/7552\r\n  * Faster downloads/uploads with Xet storage\r\n  * more info: https://github.com/huggingface/datasets/issues/7526\r\n\r\n## Other improvements and bug fixes\r\n* Add try_original_type to DatasetDict.map by @yoshitomo-matsubara in https://github.com/huggingface/datasets/pull/7544\r\n* Avoid global umask for setting file mode. by @ryan-clancy in https://github.com/huggingface/datasets/pull/7547\r\n* Rebatch arrow iterables before formatted iterable by @lhoestq in https://github.com/huggingface/datasets/pull/7553\r\n* Document the HF_DATASETS_CACHE environment variable in the datasets cache documentation by @Harry-Yang0518 in https://github.com/huggingface/datasets/pull/7532\r\n* fix regression by @lhoestq in https://github.com/huggingface/datasets/pull/7558\r\n* fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) by @giraffacarp in https://github.com/huggingface/datasets/pull/7521\r\n* Remove `aiohttp` from direct dependencies by @akx in https://github.com/huggingface/datasets/pull/7294\r\n\r\n## New Contributors\r\n* @ryan-clancy made their first contribution in https://github.com/huggingface/datasets/pull/7547\r\n* @Harry-Yang0518 made their first contribution in https://github.com/huggingface/datasets/pull/7532\r\n* @giraffacarp made their first contribution in https://github.com/huggingface/datasets/pull/7521\r\n* @akx made their first contribution in https://github.com/huggingface/datasets/pull/7294\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/3.5.1...3.6.0","publishedAt":"2025-05-07T15:17:49.000Z","url":"https://github.com/huggingface/datasets/releases/tag/3.6.0","media":[]},{"id":"rel_Vvso82HOpLLkueZ5SLHFv","version":"3.5.1","title":"3.5.1","summary":"## Bug fixes\r\n* support pyarrow 20 by @lhoestq in https://github.com/huggingface/datasets/pull/7540\r\n  * Fix pyarrow error `TypeError: ArrayExtensionA...","content":"## Bug fixes\r\n* support pyarrow 20 by @lhoestq in https://github.com/huggingface/datasets/pull/7540\r\n  * Fix pyarrow error `TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'`\r\n* Write pdf in map by @lhoestq in https://github.com/huggingface/datasets/pull/7487\r\n\r\n## Other improvements\r\n* update fsspec 2025.3.0 by @peteski22 in https://github.com/huggingface/datasets/pull/7478\r\n* Support underscore int read instruction by @lhoestq in https://github.com/huggingface/datasets/pull/7488\r\n* Support skip_trying_type  by @yoshitomo-matsubara in https://github.com/huggingface/datasets/pull/7483\r\n* pdf docs fixes by @lhoestq in https://github.com/huggingface/datasets/pull/7519\r\n* Remove conditions for Python < 3.9 by @cyyever in https://github.com/huggingface/datasets/pull/7474\r\n* mention av in video docs by @lhoestq in https://github.com/huggingface/datasets/pull/7523\r\n* correct use with polars example by @SiQube in https://github.com/huggingface/datasets/pull/7524\r\n* chore: fix typos by @afuetterer in https://github.com/huggingface/datasets/pull/7436\r\n\r\n## New Contributors\r\n* @peteski22 made their first contribution in https://github.com/huggingface/datasets/pull/7478\r\n* @yoshitomo-matsubara made their first contribution in https://github.com/huggingface/datasets/pull/7483\r\n* @SiQube made their first contribution in https://github.com/huggingface/datasets/pull/7524\r\n* @afuetterer made their first contribution in https://github.com/huggingface/datasets/pull/7436\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/3.5.0...3.5.1","publishedAt":"2025-04-28T14:02:58.000Z","url":"https://github.com/huggingface/datasets/releases/tag/3.5.1","media":[]},{"id":"rel_D31TdxxTAl7n08ZLFyIih","version":"3.5.0","title":"3.5.0","summary":"## Datasets Features\r\n* Introduce PDF support (#7318) by @yabramuvdi in https://github.com/huggingface/datasets/pull/7325\r\n\r\n```python\r\n>>> from datas...","content":"## Datasets Features\r\n* Introduce PDF support (#7318) by @yabramuvdi in https://github.com/huggingface/datasets/pull/7325\r\n\r\n```python\r\n>>> from datasets import load_dataset, Pdf\r\n>>> repo = \"path/to/pdf/folder\"  # or username/dataset_name on Hugging Face\r\n>>> dataset = load_dataset(repo, split=\"train\")\r\n>>> dataset[0][\"pdf\"]\r\n<pdfplumber.pdf.PDF at 0x1075bc320>\r\n>>> dataset[0][\"pdf\"].pages[0].extract_text()\r\n...\r\n```\r\n\r\n## What's Changed\r\n* Fix local pdf loading by @lhoestq in https://github.com/huggingface/datasets/pull/7466\r\n* Minor fix for metadata files in extension counter by @lhoestq in https://github.com/huggingface/datasets/pull/7464\r\n* Priotitize json by @lhoestq in https://github.com/huggingface/datasets/pull/7476\r\n\r\n## New Contributors\r\n* @yabramuvdi made their first contribution in https://github.com/huggingface/datasets/pull/7325\r\n\r\n**Full Changelog**: https://github.com/huggingface/datasets/compare/3.4.1...3.5.0","publishedAt":"2025-03-27T16:38:30.000Z","url":"https://github.com/huggingface/datasets/releases/tag/3.5.0","media":[]}],"pagination":{"page":1,"pageSize":20,"totalPages":5,"totalItems":100},"summaries":{"rolling":{"windowDays":90,"summary":"The library shifted toward flexible data handling and cloud-native workflows. It added support for Hugging Face Storage Buckets, letting developers load raw data directly from cloud storage, process it, and publish results—fixing multiprocessed operations on macOS in the process. Concurrently, the `Json()` type shipped to handle mixed-type fields that Arrow normally rejects, essential for datasets with arbitrary JSON structures like tool-calling examples. Support for Lance format and rich media types (Image, Video, Audio) expanded the ecosystem for specialized data formats, with `push_to_hub()` graduating to handle videos alongside structured data.","releaseCount":9,"generatedAt":"2026-04-07T17:27:57.385Z"},"monthly":[{"year":2026,"month":3,"summary":"Expanded data loading flexibility and fixed regressions across the 4.7–4.8 line. The major addition was HF Storage Bucket support, letting developers load raw data directly from cloud storage, process it with map/filter operations, and push the result to dataset repos—while also fixing a macOS segfault in multiprocessed push_to_hub. The `Json()` type arrived to handle mixed-type fields in datasets like tool calling use cases, complemented by quick fixes to JSON parsing, distributed node splitting, and Arrow iteration behavior.","releaseCount":6,"generatedAt":"2026-04-07T17:27:59.753Z"}]}}