Forge Agent

Forge - Swarm Agents for CUDA/Triton Kernel Generation | RightNow AI | RightNow AI - GPU Kernel Development

Forge is a swarm agent system that turns slow PyTorch into fast CUDA/Triton kernels. Up to 14x faster than torch.compile(mode='max-autotune-no-cudagraphs') with 100% correctness.

www.rightnowai.co

Table of Contents

Forge Agent
What It Is
Key Features
Use Cases
Pros and Cons
Pricing
How Does It Compare?
Final Thoughts

Forge Agent (by RightNow AI) is a CLI-based swarm intelligence tool that automates the creation of high-performance CUDA and Triton kernels for PyTorch models. Released in January 2026, it uses a multi-agent “swarm” approach to outperform standard compilers like torch.compile.

What It Is

Forge functions as an autonomous optimization engine for AI models. Instead of relying on static compiler rules, it deploys a swarm of AI agents that actively compete to write the most efficient GPU kernels for your specific model architecture. It targets the “last mile” of optimization—converting PyTorch code into highly tuned custom kernels (CUDA/Triton) that maximize hardware utilization on NVIDIA GPUs.

Key Features

Swarm Architecture: Deploys 32 parallel “Coder + Judge” agent pairs (64 agents total). The “Coders” generate kernel implementations while the “Judges” rigorously validate them for correctness and speed.
Nemotron-Powered: Utilizes a specialized version of the NVIDIA Nemotron 3 model, capable of generating 250k tokens/second to explore thousands of optimization strategies in minutes.
Automated Optimization: Agents autonomously experiment with advanced techniques like Tensor Core usage, memory coalescing, register blocking, and kernel fusion.
Guaranteed Correctness: A dedicated verification step ensures that the optimized kernels produce 100% numerically identical outputs to the original PyTorch code before benchmarking.
Inference-Time Scaling: The system spends more compute time (via parallel agents) during the optimization phase to produce a kernel that runs faster during inference.

Use Cases

LLM Inference Acceleration: Reducing latency for Large Language Models (LLMs) like Llama 3 and Qwen 2.5 by replacing generic layers with fused kernels.
Production Deployment: Squeezing maximum throughput out of expensive H100/B200 GPUs to reduce serving costs.
Custom Layer Optimization: Automatically writing CUDA kernels for novel research architectures where standard libraries (like FlashAttention) might not yet apply.

Pros and Cons

Pros: Achieves extreme speedups (proven 5x faster on Llama 3.1 8B vs torch.compile); automates a task that usually requires rare CUDA expertise; “Risk-Free” refund policy ensures you only pay for performance gains; works on any PyTorch model, not just standard transformers.
Cons: “Swarm” approach is compute-intensive and can be expensive to run (though the cost is usually offset by long-term inference savings); currently specialized for NVIDIA GPUs (CUDA/Triton) only; requires waiting for the “search” phase to complete (minutes to hours) unlike instant compilation.

Pricing

Usage-Based: Pay per optimization run using credits.
Performance Guarantee: They offer a full credit refund if their generated kernel does not beat the baseline torch.compile(mode='max-autotune') speed.
Free Trial: Includes optimization for one kernel for free.

How Does It Compare?

torch.compile (PyTorch 2.0): The default baseline. It uses heuristics and template matching (Inductor) to fuse kernels. Comparison: Forge consistently beats torch.compile (often by 2x-5x) because it uses agents to write custom code from scratch rather than just fusing existing templates.
NVIDIA TensorRT: NVIDIA’s proprietary high-performance optimizer. Comparison: TensorRT is powerful but can be brittle with custom layers and hard to debug. Forge offers a more flexible, code-first approach that can optimize novel layers that TensorRT might not support out of the box.
OpenAI Triton: A programming language, not a competitor. Comparison: Forge actually writes Triton code. It acts as an “Expert Triton Developer” that writes the code for you, saving you from learning the complex Triton syntax.
Modular (MAX Engine): A proprietary inference engine claiming high performance. Comparison: Modular requires migrating to their MAX runtime (and potentially Mojo). Forge works directly within your existing PyTorch/Python environment, offering lower friction for teams who want to stay in the standard PyTorch ecosystem.

Final Thoughts

Forge Agent represents the application of “System 2” thinking to code optimization. By throwing massive compute (32 parallel agents) at the problem of writing efficient kernels, it solves a talent bottleneck: the extreme scarcity of engineers who can write high-performance CUDA/Triton code.

The results—5x speedups on Llama 3.1—are significant enough to change the unit economics of AI deployment. For companies spending millions on GPU compute, the cost of running a “swarm” to optimize their model once is negligible compared to the recurring savings in inference capability. It moves us from “compiler heuristics” to “search-based optimization,” likely the future standard for high-performance computing.

Forge - Swarm Agents for CUDA/Triton Kernel Generation | RightNow AI | RightNow AI - GPU Kernel Development

Forge is a swarm agent system that turns slow PyTorch into fast CUDA/Triton kernels. Up to 14x faster than torch.compile(mode='max-autotune-no-cudagraphs') with 100% correctness.