Transformers shipped v5.0 as a major overhaul after five years, overhauling tokenization APIs and introducing dynamic weight loading with quantization support, while simultaneously accelerating a wave of multimodal and specialized model integrations. Gemma 4 arrived with vision capabilities handling variable image sizes via spatial 2D RoPE, VidEoMT landed as a lightweight video segmentation encoder achieving 5-10x speedups, and the library absorbed a steady stream of domain-specific architectures—from speech (VibeVoice ASR, VoxtralRealtime) and document understanding (PP-DocLayoutV3, UVDoc) to mixture-of-experts variants (EXAONE-MoE, GLM-5) and multilingual models (EuroBERT). Concurrent v5 RCs prioritized MoE performance optimizations using batched expert implementations and resolved tokenizer class enforcement issues by preferring the TokenizersBackend, while the v4 line stabilized with targeted fixes for model loading and generation methods.
March shipped new model support across vision, audio, and language domains. VidEoMT brought lightweight video segmentation running at 160 FPS through query propagation across frames, while EuroBERT added multilingual encoding with an 8192-token context window. The month also integrated PaddlePaddle models, Mistral 4, and Jina Embeddings v3 alongside specialized models for document layout, speech recognition, and time-series forecasting.