Agentic Vision in Gemini

Agentic Vision in Gemini

29/01/2026
Agentic Vision, a new capability introduced in Gemini 3 Flash, converts image understanding from a static act into an agentic process
blog.google

Agentic Vision in Gemini 3 Flash

Agentic Vision, introduced in Gemini 3 Flash in January 2026, transforms how AI models process visual information by replacing single-pass image interpretation with an iterative, code-driven investigation process. Rather than analyzing an image once and generating a response, the model now formulates multi-step plans, executes Python code to manipulate and inspect visuals, and grounds its conclusions in verifiable evidence.

What It Does

Agentic Vision enables Gemini 3 Flash to actively investigate images through a systematic three-phase loop. When presented with a visual query, the model first analyzes the task and formulates a strategic plan for inspection. It then generates and executes Python code to crop, zoom, rotate, annotate, or perform calculations on the image. Finally, it observes the transformed visual data within its context window before delivering a response anchored in concrete evidence rather than probabilistic inference.

This approach addresses a fundamental limitation of frontier AI models: their tendency to process images in a single static glance, forcing them to guess when fine-grained details like serial numbers, distant text, or subtle patterns are missed. By treating vision as an active investigation rather than passive observation, Agentic Vision consistently delivers a 5-10% quality improvement across vision benchmarks compared to standard processing.

Core Features

Think-Act-Observe Loop: The model employs a systematic reasoning cycle where it analyzes the query and image to create a plan, executes Python code to manipulate or analyze visual data, and observes the results before generating responses. This iterative approach replaces single-pass guessing with evidence-based conclusions.

Implicit Zoom Capability: Gemini 3 Flash autonomously detects when fine-grained details require closer inspection and generates code to crop and analyze specific image regions without explicit prompting. Early adopters like PlanCheckSolver.com, an AI-powered building plan validation platform, achieved a 5% accuracy improvement by leveraging this capability to iteratively inspect high-resolution architectural drawings for code compliance.

Visual Scratchpad Annotation: Rather than merely describing what it observes, the model can execute code to draw bounding boxes, labels, and annotations directly onto images. This visual grounding ensures reasoning is based on pixel-perfect understanding. For example, when asked to count fingers on a hand, the model draws numbered boxes around each digit to eliminate counting errors.

Deterministic Visual Computation: For tasks involving mathematical reasoning with visual data, Agentic Vision offloads calculations to Python execution rather than relying on probabilistic language model inference. This produces verifiable, reproducible results for operations like parsing high-density tables, counting objects, or performing arithmetic based on visual information.

How Does It Compare?

Agentic Vision represents a distinct approach to multimodal AI, differentiating itself from competitors through its iterative code-execution loop. The competitive landscape includes both proprietary and open-source vision-language models with varying architectural philosophies.

OpenAI GPT-4o and GPT-4 Vision:

Architecture: End-to-end unified multimodal training across text, images, audio, and video using shared attention mechanisms rather than separate vision encoders.

Capabilities: Strong single-pass vision performance with image analysis, optical character recognition, spatial reasoning, and multi-image processing in single requests.

Approach: Processes visual information through the same reasoning architecture used for language, delivering reliable context-aware perception without iterative refinement loops.

Pricing: GPT-4o costs $2.50 per million input tokens and $10 per million output tokens as of 2026.

Key Difference: Designed for unified multimodal understanding in single passes rather than iterative code-driven investigation. Does not include built-in visual manipulation through code execution.

Anthropic Claude Sonnet 3.5 and Claude 4:

Vision Launch: Introduced comprehensive vision capabilities in June 2024 with Claude 3.5 Sonnet.

Strengths: Excels at chart and graph interpretation, optical character recognition from low-quality scans, contextual image understanding, and analyzing documents with embedded visual elements.

Context Window: 200,000 tokens enabling extensive document processing.

Philosophy: Anchors visual perception within reasoning chains, interpreting content coherently across modalities rather than processing them independently. Emphasizes verifiable reasoning integrated with visual understanding.

Limitation: Does not employ code-driven iterative visual inspection loops. Focuses on long-context reasoning with perception rather than programmatic image manipulation.

Qwen2-VL by Alibaba Cloud:

Architecture: Implements Naive Dynamic Resolution for processing varying image sizes and Multimodal Rotary Position Embedding for aligning positional data across text, images, and video.

Context: Supports up to 128,000 tokens for extended document and video processing.

Video Understanding: Capable of analyzing videos exceeding 20 minutes in length.

Multilingual: Supports 29+ languages with strong multilingual OCR capabilities.

Performance: The 72-billion-parameter variant outperforms GPT-4 Vision on document-focused benchmarks including DocVQA, InfoVQA, and CC-OCR.

Use Cases: Visual agent workflows for mobile and robotic device operation, medical imaging analysis, business intelligence with combined text-visual datasets.

Key Difference: Emphasizes dynamic resolution handling and extended context for complex documents and videos. While it supports tool use and agent capabilities, it does not implement an explicit code-execution loop for iterative visual investigation like Agentic Vision.

Pixtral 12B by Mistral AI:

Release: September 2024 under Apache 2.0 open-source license.

Architecture: 400-million-parameter vision encoder trained from scratch paired with 12-billion-parameter multimodal decoder based on Mistral Nemo.

Resolution Handling: Processes images at native resolution and aspect ratio, supporting variable image sizes without tiling or cropping.

Context: 128,000-token window supporting multiple images in long conversations.

Performance: Achieves 52.5% on MMMU reasoning benchmark, substantially outperforming Qwen2-VL 7B, LLaVA-OneVision 7B, and Phi-3.5 Vision on instruction-following tasks.

Use Cases: Chart analysis, code generation from images, document question answering, multi-image inference for medical imaging or sequential visual analysis.

Key Difference: Open-source model with strong native-resolution processing and instruction following, but employs static single-pass vision rather than iterative code-driven inspection.

LLaVA by Microsoft and University of Wisconsin-Madison:

Type: Open-source vision-language model achieving 85.1% relative score compared to GPT-4 on visual understanding tasks.

Architecture: Combines pre-trained CLIP vision encoder with Vicuna language model through projection matrix, representing first end-to-end trained multimodal model at GPT-4 capability level in open-source space.

Performance: 92.53% accuracy on Science QA benchmarks.

Strengths: Natural conversation about visual content, educational content analysis, accessibility applications, visual question answering.

Limitation: Does not implement agentic code execution loops. Research indicates visual inputs are mapped to separate space from textual inputs rather than unified semantic space.

Strengths and Limitations

Strengths: Agentic Vision addresses a fundamental weakness in frontier AI vision systems by enabling active investigation rather than passive observation. The iterative code-execution approach produces verifiable, deterministic results for visual computation tasks where probabilistic inference often fails. The 5-10% quality improvement across vision benchmarks demonstrates measurable gains in accuracy. Integration with Gemini 3 Flash provides frontier-class performance at 3x the speed of Gemini 2.5 Pro while costing significantly less. The implicit zoom capability reduces the need for manual prompt engineering, making the system more accessible to developers without deep AI expertise.

Limitations: Currently limited to Gemini 3 Flash model size, though expansion to other Gemini variants is planned. While zoom functionality operates implicitly, other manipulations like rotation or complex visual mathematics still require explicit prompting, though Google plans to make these fully implicit in future releases. The iterative code-execution approach adds latency compared to single-pass vision systems, which may be unsuitable for latency-critical applications. Effectiveness depends on the quality of generated Python code and the types of visual tasks being performed. As a developer-oriented capability requiring API integration, it is not a consumer-facing feature accessible through simple interfaces. Results depend on whether tasks genuinely benefit from iterative inspection versus comprehensive single-pass analysis.

Best For

Agentic Vision is particularly well-suited for applications requiring fine-grained visual inspection, evidence-based reasoning, or mathematical operations grounded in visual data. Compliance and quality control systems benefit significantly, as demonstrated by building plan validation platforms achieving measurable accuracy improvements through iterative inspection of architectural drawings. Document analysis workflows involving high-density tables, forms, or technical diagrams can leverage the visual computation capabilities to extract and verify data programmatically. Scientific and medical imaging applications that require counting objects, measuring dimensions, or detecting subtle anomalies gain verifiable precision through code-driven analysis. Developers building agentic systems that need to operate autonomously on visual inputs with minimal hallucination risk will find the evidence-grounding approach essential for reliability.

Pricing and Availability

Agentic Vision is available through Google AI Studio, Vertex AI, and is rolling out to the Gemini app when users select Thinking mode from the model dropdown. Developers can access the capability by enabling Code Execution under Tools in the Google AI Studio Playground or through the Gemini API in Vertex AI.

Gemini 3 Flash pricing is $0.50 per million input tokens and $3.00 per million output tokens, with audio input billed at $1.00 per million tokens. The model uses approximately 30% fewer tokens on average compared to Gemini 2.5 Pro for typical tasks while delivering superior performance, making it highly cost-effective for production deployments.

Technical Context

Agentic Vision builds on Google’s broader push toward agentic AI systems capable of multi-step reasoning, tool use, and autonomous action under supervision. Gemini 2.0 Flash, introduced in December 2024, pioneered native tool integration including Google Search, code execution, and third-party function calling. Agentic Vision extends this paradigm specifically to the visual domain.

The underlying code execution capability was first added to the Gemini API in June 2024, with multi-tool use enabling simultaneous code execution and search launched in May 2025. This infrastructure provides the foundation for Agentic Vision’s Think-Act-Observe loop.

Gemini 3 Flash, released December 16, 2025, delivers frontier-class performance on benchmarks including 90.4% on GPQA Diamond for PhD-level reasoning, 81.2% on MMMU Pro for multimodal understanding, and 78% on SWE-bench Verified for coding agent capabilities. The model operates at 3x the speed of Gemini 2.5 Pro while outperforming it across multiple benchmarks, positioning it as a high-performance platform for agentic workflows.

Industry Implications

Agentic Vision reflects a broader shift in AI development from passive pattern recognition toward active investigation and tool use. As AI systems increasingly function as autonomous agents rather than response generators, the ability to iteratively refine understanding through code execution and observation becomes strategically important.

For industries requiring visual compliance verification, quality control, or detailed document analysis, the evidence-grounding approach reduces hallucination risk and provides audit trails through executable code. Development teams building vision-powered automation can leverage the capability to handle edge cases and fine-grained details that single-pass systems miss.

The architectural choice to combine visual reasoning with deterministic code execution rather than purely neural approaches suggests a hybrid future where language models orchestrate programmatic tools rather than attempting to learn all capabilities through parameters alone.

Final Thoughts

Agentic Vision represents a meaningful evolution in how AI models process visual information, shifting from single-pass interpretation to iterative, evidence-based investigation. For applications where accuracy, verifiability, and fine-grained detail matter more than raw speed, the code-driven approach delivers measurable improvements over traditional vision systems.

The 5-10% quality gain across benchmarks may seem modest, but in compliance-critical domains like building plan validation, medical imaging, or quality control, this improvement translates directly to reduced error rates and fewer false positives. The ability to ground reasoning in pixel-perfect visual evidence rather than probabilistic inference addresses a core limitation of frontier models.

However, organizations should evaluate whether their use cases genuinely require iterative inspection versus comprehensive single-pass analysis. Applications needing real-time vision at minimal latency may find the code-execution loop adds unacceptable overhead. Tasks involving holistic scene understanding rather than fine-detail verification may not benefit from the iterative approach.

For developers building agentic systems that operate autonomously with visual inputs, Agentic Vision provides a framework for reducing hallucinations and increasing reliability. The ability to programmatically manipulate images, perform verifiable computations, and ground conclusions in visual evidence creates a foundation for trustworthy automation in high-stakes environments.

As Google expands the capability to additional model sizes and makes more behaviors implicit, the distinction between iterative and single-pass vision may blur. The broader question is whether the future of multimodal AI lies in unified end-to-end architectures that learn perception and reasoning jointly, or hybrid systems that orchestrate specialized tools through code execution. Agentic Vision represents a bet on the latter approach, and early results suggest the strategy delivers measurable value for specific task categories.

Agentic Vision, a new capability introduced in Gemini 3 Flash, converts image understanding from a static act into an agentic process
blog.google