Transformers v5

Transformers v5

03/12/2025
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co

Overview

The AI development landscape just received a seismic shift with the release of Transformers v5, the library’s biggest and most significant update in over five years. Released on November 30, 2025 as v5.0.0rc0 (first release candidate), this new version isn’t just an incremental patch; it’s a complete reimagining designed for the modern AI era. With a new modular design, first-class support for quantization, and enhanced serving capabilities, Transformers v5 is optimized exclusively for PyTorch and built to be fully interoperable with essential tools like vLLM, SGLang, llama.cpp, MLX, and ONNX. Since v4’s launch in November 2020, daily downloads have surged from 20,000 to over 3 million, with total installations exceeding 1.2 billion. This update solidifies its position as the go-to framework for building, training, and deploying state-of-the-art models.

Key Features

These new capabilities are powered by a set of core features designed for maximum efficiency and flexibility. Here’s what stands out in version 5:

  • Modular Architecture: The redesigned architecture allows developers to more easily mix, match, and customize components. This modularity simplifies experimentation and the creation of novel model structures. The introduction of AttentionInterface provides a centralized abstraction layer for attention mechanisms, consolidating FlashAttention variants, FlexAttention, and SDPA while keeping Eager methods in modeling files.
  • First-Class 4-bit/8-bit Quantization: Native, high-quality support for quantization allows models to run with a significantly smaller memory footprint and faster inference speeds, making it possible to deploy large models on consumer-grade hardware. This represents a fundamental change to weight loading, with quantization integrated at the core level rather than as an add-on feature. The release includes enhanced collaboration with TorchAO and bitsandbytes for tensor parallelism and mixture-of-experts models.
  • Enhanced Serving Capabilities: A new transformers serve command deploys an OpenAI API-compatible server for serving Transformers models. Additionally, the release introduces inference-specific features including continuous batching and paged attention mechanisms designed for high-volume scenarios like model evaluation.
  • PyTorch-First Optimization: The library now fully embraces PyTorch, dropping support for TensorFlow and Flax to deliver a more streamlined, optimized, and deeply integrated experience for the PyTorch ecosystem. This decision removes approximately 50% of the codebase, reducing abstraction layers and bloat. TensorFlow and Flax support will remain available in v4.x as Long-Term Support until approximately mid-2026.
  • Expanded Model Ecosystem: The library has grown from 20 architectures in v4.0 to over 400 model architectures in v5, with an average of 1-3 new models added weekly over the past five years.
  • Format Interoperability: GGUF files can now be loaded in Transformers for further fine-tuning, and models can be easily converted to Safetensors, ONNX, or GGUF formats for deployment across different frameworks.

How It Works

At its core, Transformers v5 simplifies the complex process of working with large language models. Developers begin by importing pre-trained model definitions from the massive Hugging Face Hub or defining their own. From there, the library takes over the heavy lifting, automatically handling complex kernel optimizations to ensure peak performance during training and inference.

The v5 release emphasizes three core workflows: pre-training (with redesigned model initialization and optimized forward/backward propagation operators), fine-tuning and post-training (maintaining compatibility with Python ecosystem tools and expanding support for JAX ecosystem tools like MaxText), and inference (with dedicated kernels, cleaner default settings, new APIs, and optimized support for inference engines).

Once a model is ready, the framework provides seamless export capabilities, allowing for easy conversion and deployment to specialized, high-performance inference engines like vLLM, SGLang, llama.cpp (GGUF), MLX, and ONNXRuntime. The automated model conversion tooling uses machine learning to identify code similarities between modeling files, potentially opening draft pull requests for new model integrations.

Use Cases

This powerful and flexible framework unlocks a wide range of applications for developers, researchers, and businesses. Key use cases include:

  • Fine-tuning LLMs on Custom Data: Easily adapt powerful foundation models to specific domains or tasks by training them on your proprietary datasets with enhanced quantization-aware training support.
  • Serving Models via Production APIs: Deploy fine-tuned models as robust, scalable, and high-performance APIs ready for integration into production applications using the new transformers serve command.
  • Converting Models to GGUF: Effortlessly convert models into the GGUF format, making them compatible with llama.cpp for efficient execution on CPUs and local machines, or load GGUF files directly for further fine-tuning.
  • Research \& Development: Leverage the modular architecture and cutting-edge features to experiment with new model designs and push the boundaries of AI research.
  • Pre-training Large Models: The release now provides robust support for large-scale pre-training through compatibility with torchtitan, megatron, and nanotron, expanding beyond the previous focus on fine-tuning.
  • Agentic Workflows: Enhanced support for agentic use cases through OpenEnv or Prime Environment Hub integration, leveraging standardized model definitions across tools.

Pros \& Cons

Like any major strategic update, Transformers v5 comes with a distinct set of advantages and trade-offs.

Advantages

  • The Ecosystem Standard: It remains the undisputed industry standard, trusted by researchers and enterprises worldwide as the source of truth for model definitions.
  • Massive Community: Access to a vast and active community provides unparalleled support, resources, and a constant stream of new models, contributing to 1.2 billion total installations.
  • Free and Open Source: The library is completely free to use, making state-of-the-art AI accessible to everyone.
  • Streamlined Codebase: Removal of TensorFlow and Flax support eliminates approximately 50% of the code, reducing abstraction layers and simplifying maintenance.
  • Production-Ready Inference: Enhanced interoperability with vLLM, SGLang, and other specialized inference engines ensures best-in-class deployment performance.
  • Quantization as Core Feature: First-class quantization support reflects industry trends, with many SOTA models now released in low-precision formats (gpt-oss, Kimi-K2, Deepseek-r1).

Disadvantages

  • Reduced Framework Flexibility: By dropping support for TensorFlow and Flax, teams who have built their stacks around these frameworks will face challenges migrating or be unable to adopt the latest version without staying on v4.x LTS (supported until mid-2026).
  • Learning Curve for Migration: Users need to review breaking changes in tokenization (removal of “Fast” and “Slow” distinction), processing (nested dict serialization in processor_config.json), and modeling (removal of head masking, relative positional biases in Bert-like models).
  • JAX/Flax Users Require Workarounds: While Hugging Face is working with partners like MaxText, MaxDiffusion, and Keras 3 to maintain compatibility, JAX/Flax users must rely on external tools rather than native support.

How Does It Compare?

Transformers v5 vs. PyTorch/Keras

Transformers is not a direct competitor to foundational frameworks like PyTorch or Keras. Instead, it is a high-level library that sits on top of PyTorch (exclusively in v5), abstracting away much of the complexity involved in building and training transformer-based models. PyTorch provides the tensor operations and autograd system; Transformers provides pre-built model architectures, tokenizers, and training utilities.

Relationship:

  • PyTorch: Foundation framework for tensor computation and automatic differentiation
  • Transformers v5: Model-definition library built exclusively on PyTorch, providing 400+ architectures with standardized APIs

When to Use Transformers: For working with pre-trained transformer models, implementing state-of-the-art architectures, or fine-tuning foundation models.
When to Use PyTorch Directly: For custom neural network architectures beyond transformers, low-level tensor operations, or building from scratch without pre-trained models.

Transformers v5 vs. vLLM/SGLang

Previously seen as separate tools for inference, vLLM and SGLang are now key partners in the ecosystem. Transformers v5 is designed for seamless interoperability with these engines, allowing developers to train/fine-tune models in Transformers and serve them with vLLM or SGLang for best-in-class inference performance.

Transformers v5 Role:

  • Model definition and training
  • Fine-tuning and customization
  • Research and experimentation
  • Export to production formats

vLLM/SGLang Role:

  • Optimized production inference
  • High-throughput serving
  • Advanced batching and scheduling
  • Specialized kernel optimizations

Philosophy: Transformers v5 does not aim to replace specialized inference engines. Instead, it focuses on compatibility and interoperability, ensuring smooth workflows from development to production deployment.

Transformers v5 vs. llama.cpp/GGUF

llama.cpp is a C++ inference engine optimized for running LLMs on CPUs with GGUF format quantization.

Interoperability:

  • Transformers v5 can now load GGUF files directly for further fine-tuning
  • Models trained in Transformers can be easily converted to GGUF format for llama.cpp deployment
  • This bidirectional compatibility enables training in Python with Transformers and deploying on CPUs with llama.cpp

When to Use Transformers: For training, fine-tuning, GPU-based experimentation, and Python ecosystem workflows.
When to Use llama.cpp: For CPU-based inference, edge device deployment, and situations requiring minimal dependencies.

Transformers v5 vs. MLX/ONNX

MLX (Apple’s machine learning framework) and ONNX (Open Neural Network Exchange) represent additional deployment targets.

Compatibility:

  • Transformers v5 provides direct export to ONNX format for cross-platform deployment
  • Safetensors format enables direct compatibility with MLX
  • Collaboration with Awni Hannun (MLX team) ensures smooth interoperability

Ecosystem Position: Transformers v5 serves as the central model definition hub, with standardized formats enabling deployment to specialized frameworks without friction.

Transformers v5 vs. MaxText/Keras 3 (JAX Ecosystem)

For users requiring JAX/Flax functionality after v5’s PyTorch-only decision:

Hugging Face Strategy:

  • Collaboration with MaxText, MaxDiffusion, and Keras 3 to use Transformers model definitions
  • Maximum compatibility maintained with JAX ecosystem outside direct PyTorch support
  • Model definitions remain accessible; execution shifts to partner tools

Migration Path: JAX users can stay on v4.x LTS (supported until mid-2026) or migrate to partner tools that consume Transformers model definitions.

Final Thoughts

Transformers v5 is a bold and forward-thinking update that aligns the library with the dominant trends in the AI industry. Its deep integration with PyTorch, focus on efficient deployment through quantization, and partnership with tools like vLLM, SGLang, and llama.cpp make it an even more indispensable part of the modern AI developer’s toolkit.

The decision to drop TensorFlow and Flax support, while eliminating approximately 50% of the codebase and reducing abstraction layers, represents a strategic bet on PyTorch’s consolidation as the industry standard. The PyTorch Foundation’s executive director Matt White emphasized this shift, noting that Transformers is “going all in on PyTorch” with v5.

The move allows for a more focused and optimized experience that will undoubtedly accelerate innovation for the vast majority of the community. With quantization as a first-class citizen, enhanced inference capabilities, and seamless interoperability across the AI stack, Transformers v5 establishes itself as the definitive model-definition framework for the next era of AI development.

While the departure from TensorFlow and Flax may be a hurdle for some users, the v4.x Long-Term Support (maintained until mid-2026) provides a migration runway. For teams committed to these frameworks, partnership tools in the JAX ecosystem offer continued access to Transformers model definitions while using alternative execution engines.

The release of v5.0.0rc0 as the first release candidate on November 30, 2025, marks the beginning of an iterative refinement process. The team plans to release subsequent release candidates to ensure a robust final v5 release, building on five years of growth from 20,000 to over 3 million daily downloads and solidifying Transformers’ position as the ecosystem standard for AI model development.

We’re on a journey to advance and democratize artificial intelligence through open source and open science.
huggingface.co