Forge CLI - Best AI Tool Finder

Forge - Swarm Agents for CUDA/Triton Kernel Generation | RightNow AI | RightNow AI - GPU Kernel Development

Forge is a swarm agent system that turns slow PyTorch into fast CUDA/Triton kernels. Up to 14x faster than torch.compile(mode='max-autotune-no-cudagraphs') with 100% correctness.

www.rightnowai.co

Table of Contents

Forge CLI: The Swarm-Based Kernel Optimizer for NVIDIA GPUs
Key Features
How It Works
Use Cases
Pros and Cons
Pricing
How Does It Compare?
Final Thoughts

Forge CLI: The Swarm-Based Kernel Optimizer for NVIDIA GPUs

Forge CLI is a high-performance command-line system designed to bridge the gap between high-level PyTorch/HuggingFace models and low-level GPU hardware optimization. Launched in January 2026, it addresses the performance ceiling often hit by standard deep learning compilers. By utilizing a swarm-based approach, Forge automatically generates hand-tuned CUDA and Triton kernels for every layer of a neural network, delivering inference speeds that are significantly superior to standard automated tuning methods.

At the core of the system is a 32-agent parallel swarm where “Coder” agents generate various optimization strategies—such as advanced tensor core utilization and memory coalescing—while “Judge” agents rigorously validate the output for correctness. This architecture ensures that extreme speed gains do not come at the cost of numerical stability. The system is specifically optimized for cutting-edge datacenter hardware, including NVIDIA H100, H200, and the B200 Blackwell series, making it an essential tool for ML Engineers managing large-scale inference workloads.

Key Features

HuggingFace-Native Optimization: Input any HuggingFace model ID or local PyTorch file to instantly begin the multi-layer kernel generation process.
Swarm Intelligence Architecture: Employs 32 parallel Coder+Judge agent pairs that compete in real-time to find the absolute fastest kernel implementation for your specific hardware.
Inference-Time Scaling Engine: Powered by an optimized NVIDIA Nemotron 3 Nano 30B model generating 250,000 tokens per second to explore the vast optimization search space in minutes.
Extreme Speed Benchmarks: Achieves up to 5x faster inference performance compared to PyTorch’s native torch.compile(mode='max-autotune') protocol.
Guaranteed Numeric Correctness: Maintains a 97.6% correctness rate verified through automated “Judge” agent cross-validation and hardware-level unit testing.
Native Triton & CUDA Output: Generates clean, readable, and highly optimized code in both CUDA and Triton, allowing for manual inspection or further customization if needed.
Risk-Free Performance Policy: Provides a full credit refund if the Forge swarm is unable to beat the performance of torch.compile(mode='max-autotune') for your specific model architecture.
Broad GPU Ecosystem Support: Full compatibility with consumer RTX cards and enterprise-grade hardware including H100, H200, and the latest B200 GPUs.

How It Works

The Forge workflow begins at the command line. When a user provides a model ID, the Forge system initializes a swarm of 64 total agents (32 pairs). The Coder agents use the high-throughput Nemotron 3 Nano model to rapidly draft different kernel configurations, focusing on specific bottlenecks like kernel fusion or memory bottlenecks. The Judge agents then execute these kernels on a virtualized GPU environment to verify that the mathematical output matches the original model. The fastest verified kernel is selected for each layer. The final output is a set of optimized kernels that can be directly integrated into the user’s production inference pipeline.

Use Cases

Large-Scale Production Inference: Enterprises running LLMs or generative models at scale can use Forge to reduce total GPU hours and operational costs by maximizing per-chip throughput.
Custom Transformer Optimization: Researchers developing novel transformer variants can ensure their custom layers are as efficient as possible without manually writing complex CUDA code.
Hardware-Specific Fine-Tuning: Optimize the same HuggingFace model for different environments (e.g., an H100 in the cloud and an RTX 4090 locally) to get the best possible performance on each.
Legacy Model Performance Boosting: Breathe new life into older PyTorch models by applying modern swarm-based optimizations that weren’t available when the models were originally released.

Pros and Cons

Pros: Delivers massive performance gains (up to 5x) that standard compilers miss. Highly automated “one-click” experience for HuggingFace users. Transparent refund policy if performance targets aren’t met.
Cons: High technical barrier for entry; primarily targeted at ML Engineers rather than general developers. Performance is strictly tied to NVIDIA hardware ecosystems.

Pricing

Starter Plan: Free. Includes limited access to the basic swarm for small models and standard profiling tools.
Pro Plan: $49/month. Unlocks full swarm access for any HuggingFace model, advanced kernel database retrieval, and access to H100-optimized implementation strategies.
Enterprise Plan: $200/month. Designed for teams requiring unlimited parallel agent credits, B200 Blackwell support, custom kernel fusion logic, and dedicated priority support.

How Does It Compare?

PyTorch (torch.compile): The industry standard. While excellent for general use, torch.compile focuses on broad compatibility. Forge identifies and exploits specific hardware-level optimizations that standard compilers often overlook for complex architectures.
NVIDIA TensorRT: A powerful optimization SDK. TensorRT is highly effective but often requires complex manual setup and quantization steps. Forge simplifies this by using AI agents to “write” the optimizations for you in minutes.
Triton: A language for writing fast GPU kernels. Forge is effectively an automated “Triton expert” that generates the code for you, saving weeks of manual kernel development time.
TVM (Apache): An open-source machine learning compiler. TVM is highly portable but lacks the “inference-time scaling” logic of Forge, which allows for deeper, AI-driven exploration of the optimization space.

Final Thoughts

Forge CLI is a pioneer in the “agentic compiler” space of 2026. By treating kernel optimization as a search-and-verify problem solvable by swarm intelligence, it removes one of the biggest bottlenecks in the AI development lifecycle. The use of a specialized 30B Nemotron model ensures that the optimization search is both fast and deep, often finding unique implementations that human engineers would take weeks to discover. For teams looking to squeeze every drop of performance out of their NVIDIA hardware, Forge is a highly competitive and virtually risk-free investment.

Forge - Swarm Agents for CUDA/Triton Kernel Generation | RightNow AI | RightNow AI - GPU Kernel Development

Forge is a swarm agent system that turns slow PyTorch into fast CUDA/Triton kernels. Up to 14x faster than torch.compile(mode='max-autotune-no-cudagraphs') with 100% correctness.