
Table of Contents
Overview
On November 18-19, 2025, Meta released two distinct but complementary additions to its Segment Anything Collection: SAM 3 for advanced image and video segmentation, and SAM 3D for single-image 3D reconstruction. While these models share the “Segment Anything” branding and both advance computer vision capabilities, they serve fundamentally different purposes and operate independently.
SAM 3 represents the third generation of Meta’s segmentation foundation model, introducing highly requested text-based prompting capabilities alongside the visual prompts (clicks, boxes, masks) from previous versions. For the first time, users can segment objects by typing simple noun phrases like “yellow school bus” or “person wearing glasses,” with the model detecting and segmenting all matching instances simultaneously. This “Promptable Concept Segmentation” capability transforms SAM from a geometric tool into an open-vocabulary vision system.
SAM 3D comprises two specialized models released concurrently: SAM 3D Objects for reconstructing everyday items and scenes, and SAM 3D Body for human pose and shape estimation. Both models generate detailed 3D representations from single 2D images, addressing the longstanding challenge of inferring three-dimensional structure without multi-view photography or depth sensors. These models are accessible through the Segment Anything Playground alongside downloadable weights and research code.
Key Features
SAM 3 Features
Promptable Concept Segmentation: The defining innovation in SAM 3 is its ability to accept text prompts—short noun phrases describing visual concepts—and detect every instance of that concept throughout an image or video. Unlike SAM 1 and SAM 2, which segmented single objects per prompt, SAM 3 performs open-vocabulary instance detection, returning unique masks and IDs for all matching objects simultaneously.
Text and Exemplar Prompting: Users can guide segmentation through simple noun phrases (“shipping container,” “traffic cone”) or by providing example images of target objects. The model learns the visual concept from examples and finds similar objects across the scene.
Visual Prompt Compatibility: SAM 3 maintains full backward compatibility with SAM 2’s interactive visual prompts including point clicks, bounding boxes, and mask refinements. This allows hybrid workflows combining concept-level detection with interactive fine-tuning.
Video Tracking: Building on SAM 2’s temporal understanding, SAM 3 tracks detected concepts across video frames using memory-based masklet tracking. Users can identify objects by text description and follow them throughout video sequences.
Unified Architecture: A single SAM 3 model handles both Promptable Concept Segmentation and Promptable Visual Segmentation tasks through shared backbone infrastructure, eliminating the need for separate specialized models.
Meta Perception Encoder Integration: SAM 3 incorporates Meta’s Perception Encoder released in April 2025, achieving significant performance improvements over previous encoder architectures for visual feature extraction.
SA-Co Benchmark: Meta released the Segment Anything with Concepts benchmark dataset for evaluating open-vocabulary segmentation across images and videos, advancing reproducible research in this domain.
SAM 3D Features
SAM 3D Objects – Object and Scene Reconstruction: This model generates textured 3D meshes of masked objects from single images, handling complex real-world photography with occlusions, varied lighting, and cluttered backgrounds. The system produces independent, posed 3D models suitable for manipulation, repositioning into scenes, and export to 3D software.
SAM 3D Body – Human Pose and Shape Estimation: Specialized for human reconstruction, this model estimates 3D body shape, skeletal structure, and pose from single photographs. It introduces the Meta Momentum Human Rig format that separately represents skeletal structure and soft tissue, enabling interpretable body models compatible with animation pipelines.
Prompt-Based Refinement: Both SAM 3D models support interactive prompting through 2D keypoints and masks, allowing users to guide reconstruction in challenging scenarios with ambiguous poses, occlusions, or complex scenes.
Multi-Stage Training Pipeline: SAM 3D Objects employs large-scale synthetic pre-training followed by real-world fine-tuning with human-in-the-loop annotation, achieving generalization to diverse photography conditions while maintaining geometric accuracy.
Real-Time Performance: Engineering optimizations including diffusion shortcuts enable SAM 3D Objects to generate textured reconstructions in seconds, supporting near real-time applications in robotics and interactive editing.
Developer Ecosystem: Both SAM 3 and SAM 3D provide open-source model weights, research code on GitHub, API access through the Segment Anything Playground, and comprehensive documentation for integration into custom applications.
How It Works
SAM 3 Workflow
SAM 3 operates through a unified architecture combining detection, segmentation, and tracking components. Users begin by uploading images or videos to the Segment Anything Playground or integrating SAM 3 through its Python API in local deployments.
For text-based segmentation, users enter a short noun phrase describing the target concept. SAM 3’s detector component, based on the DETR transformer architecture, processes the text embedding alongside image features from the Meta Perception Encoder. The model predicts bounding boxes and segmentation masks for all instances matching the concept description, assigning unique IDs to each detection.
In video applications, SAM 3 extends detection across temporal sequences using memory-bank tracking inherited from SAM 2. The model maintains object identities frame-to-frame, generating consistent masklets (spatio-temporal segments) that follow objects through motion, occlusion, and appearance changes.
For visual prompting workflows, users interact directly with images through point clicks, box drawings, or rough mask sketches. SAM 3 refines these prompts into precise segmentation masks, supporting iterative refinement through additional prompts that add or remove regions.
The model architecture scales linearly with the number of tracked objects—each receives independent processing using shared per-frame embeddings. This design prioritizes robustness and simplicity over inter-object reasoning, though future versions may incorporate shared contextual information.
SAM 3D Workflow
SAM 3D Objects operates through a multi-stage pipeline designed for real-world photography robustness. Users provide a single 2D image with the target object either automatically detected or manually masked using SAM 3 or traditional selection tools.
The model’s image encoder captures high-resolution geometric and texture details from the masked region. A transformer decoder predicts 3D mesh geometry, camera pose, and texture maps, inferring occluded surfaces through learned priors about object structure and physical plausibility.
For disambiguation in challenging cases, users can provide 2D keypoint prompts indicating specific surface features or structural elements. The model incorporates these hints to refine reconstruction accuracy.
SAM 3D Body follows a similar architecture adapted for human-specific reconstruction. The multi-input encoder captures full-body context alongside detailed body part features. The mesh decoder predicts Meta Momentum Human Rig parameters representing skeletal pose and soft tissue shape separately.
Both models generate outputs in seconds on modern GPUs. SAM 3D Objects exports textured meshes compatible with standard 3D formats, while SAM 3D Body outputs rigged character models suitable for animation software.
The training methodology combines synthetic data pre-training on large-scale 3D asset datasets with real-world fine-tuning using human-annotated preferences. For SAM 3D Objects, Meta built a data engine where annotators rate model-generated meshes, routing difficult cases to expert 3D artists. This human-in-the-loop approach annotated nearly 1 million distinct images with approximately 3.14 million model-generated meshes.
Use Cases
SAM 3 Applications
Dataset Annotation and Labeling: Computer vision researchers use SAM 3 to dramatically accelerate dataset creation. Text prompts like “pedestrian” or “bicycle” can automatically segment all relevant instances across thousands of images, with Meta reporting 36% faster annotation than human-only pipelines for positive prompts and 5x speedups for negative prompts.
Video Content Editing: Film and content creators leverage SAM 3’s tracking capabilities to isolate and manipulate specific objects or people throughout video sequences. Typing “person in red jacket” segments that individual across all frames for targeted color grading, effects, or removal.
Robotics Perception: Autonomous systems use SAM 3 as a perception module for identifying and tracking objects in first-person camera feeds. Meta demonstrated integration with Aria Gen 2 research glasses for egocentric vision tasks in dynamic environments.
Multimodal AI Reasoning: Large language models utilize SAM 3 as a vision tool through the SAM 3 Agent framework. The LLM proposes noun phrase queries to SAM 3 and analyzes returned masks iteratively until satisfying complex text queries like “What object in the picture is used for controlling and guiding a horse?”
Instagram and Meta Product Integration: SAM 3 powers new creative features in Instagram Edits and Vibes apps, bringing professional-grade segmentation capabilities directly to mobile creators for effects, cutouts, and compositing.
Scientific and Medical Imaging: Researchers apply SAM 3 to microscopy, satellite imagery, and medical scans where manual annotation proves prohibitively time-consuming. Text-based concept detection enables systematic analysis across large image collections.
SAM 3D Applications
3D Asset Creation for Games and VR: Game developers and VR content creators use SAM 3D Objects to rapidly generate 3D models from reference photographs, dramatically reducing modeling time compared to manual sculpting or traditional photogrammetry requiring multiple angles.
E-Commerce 3D Product Visualization: Online retailers reconstruct products from catalog photography into interactive 3D viewers, enabling customers to examine items from any angle without requiring specialized 3D capture equipment.
Robotics Manipulation Planning: Autonomous robots use SAM 3D Objects for on-the-fly 3D perception of manipulable objects from single camera views, planning grasps and interactions based on inferred geometry.
Virtual Try-On and Fashion: SAM 3D Body enables realistic virtual clothing try-on by reconstructing customer body shapes from photos, allowing accurate fit prediction and garment visualization before purchase.
Motion Capture and Animation: Animators use SAM 3D Body to extract character rigs from reference photos or video frames, accelerating character setup for animation without requiring motion capture suits or multi-camera studios.
Augmented Reality Scene Understanding: AR applications leverage SAM 3D for reconstructing real-world objects and people, enabling realistic digital-physical interactions and occlusion-aware content placement.
Medical Planning and Prosthetics: Healthcare providers use SAM 3D Body for patient-specific modeling from photographs, supporting prosthetic design, surgical planning, and physical therapy assessment without expensive 3D scanning equipment.
Pros and Cons
SAM 3 Advantages
Open-Vocabulary Flexibility: The shift from fixed visual prompts to natural language concept descriptions dramatically expands accessibility for non-technical users and enables complex query formulation previously requiring computer vision expertise.
Unified Model Architecture: A single SAM 3 checkpoint handles image segmentation, video tracking, interactive refinement, and concept detection, eliminating the need to manage multiple specialized models for different tasks.
State-of-the-Art Benchmarks: SAM 3 doubles concept segmentation accuracy (cgF1 scores) compared to existing models on the SA-Co benchmark, with users preferring SAM 3 outputs over the strongest baseline (OWLv2) approximately 3:1 in human studies.
Strong Research Foundation: Open-source release of model weights, training code, and the SA-Co benchmark dataset enables reproducible research and community innovation building on Meta’s foundation.
Production Deployment: Integration into Instagram products demonstrates real-world viability beyond research prototypes, with millions of users benefiting from SAM 3-powered creative tools.
SAM 3 Disadvantages
Computational Scaling: Video tracking cost scales linearly with object count, as each tracked concept receives independent processing. Complex scenes with dozens of objects may encounter performance bottlenecks or require hardware acceleration.
Short Phrase Limitation: While SAM 3 handles simple noun phrases well, complex queries requiring reasoning (“people sitting down but not holding a gift box”) need intermediation through multimodal LLMs like SAM 3 Agent, adding latency and complexity.
Research-Stage Tooling: As a recently released research model, production deployment tooling, optimizations, and best practices are still evolving compared to mature computer vision platforms with years of enterprise refinement.
GPU Requirements: Achieving real-time or near-real-time performance demands server-scale GPUs, limiting deployment in resource-constrained edge environments or consumer devices without optimization.
SAM 3D Advantages
Single-Image Input: Unlike traditional photogrammetry requiring dozens of images from multiple angles, SAM 3D generates full 3D reconstructions from a single photograph, dramatically simplifying capture workflows.
State-of-the-Art Quality: SAM 3D Objects achieves at least 5:1 win rates over competing methods in human preference tests, demonstrating superior geometry and texture quality across diverse object categories.
Occlusion Handling: The models infer plausible geometry for occluded and invisible surfaces based on learned priors, generating complete 3D models even when significant portions are hidden in the input image.
Human-Specialized Performance: SAM 3D Body delivers step-change accuracy improvements on 3D human pose benchmarks compared to previous methods, particularly for challenging poses, occlusions, and diverse clothing.
Interactive Refinement: Prompt-based control through 2D keypoints and masks enables users to guide reconstruction in ambiguous cases, combining automated inference with human expertise.
SAM 3D Disadvantages
Inference vs. Ground Truth: While SAM 3D infers plausible 3D geometry, invisible surfaces represent learned assumptions rather than measured reality. Applications requiring geometric precision may need multi-view verification.
Computational Cost: Generating high-quality textured meshes requires seconds of GPU processing per object, which may be prohibitive for real-time applications or large-scale batch processing without distributed infrastructure.
Domain Specialization: SAM 3D Objects and Body are separate models specialized for different content types. General scenes containing both objects and people require coordinated deployment of both models.
Training Data Bias: Model performance reflects training dataset characteristics. Objects or poses underrepresented in training data may receive lower-quality reconstructions, particularly for rare object categories or unusual human configurations.
Format Compatibility: While outputs export to standard 3D formats, specialized features like the Meta Momentum Human Rig require compatible software for full utilization, potentially limiting downstream tool compatibility.
How Does It Compare?
SAM 3 and SAM 3D enter competitive markets with established alternatives, but introduce unique capabilities that differentiate them from existing solutions:
Image Segmentation Models
YOLOv8, YOLOv11 Instance Segmentation: The YOLO series offers real-time object detection and instance segmentation optimized for speed. Recent models like YOLO11 achieve high frame rates on edge devices with efficient architectures. YOLO-SAM, a hybrid combining YOLO-World and EfficientSAM, demonstrates joint detection and segmentation in unified pipelines. Compared to SAM 3, YOLO models prioritize inference speed and edge deployment over zero-shot generalization. YOLO requires training on labeled datasets for specific object categories, while SAM 3 handles open-vocabulary concepts without fine-tuning. For applications requiring real-time edge processing with fixed object categories, YOLO excels. For flexible, zero-shot segmentation across arbitrary concepts described by text, SAM 3 provides superior generalization.
Grounding DINO: A vision-language model enabling text-prompted object detection combining transformer architectures with language grounding. Grounding DINO accepts complex text queries and localizes described objects. Compared to SAM 3, Grounding DINO focuses on detection (bounding boxes) rather than precise segmentation masks, though it can be combined with SAM-style models for full segmentation pipelines. SAM 3 integrates detection and segmentation in a unified architecture.
GLEE, OWLv2, LLMDet: Specialist open-vocabulary detection and segmentation models evaluated against SAM 3 in Meta’s benchmarks. SAM 3 demonstrates superior performance, particularly on rare and fine-grained concepts, with users preferring SAM 3 outputs approximately 3:1 over OWLv2, the strongest baseline.
Google Cloud Vision, AWS Rekognition, Azure Computer Vision: Enterprise cloud vision APIs offering pre-trained segmentation, object detection, and image understanding. These platforms provide production-grade reliability, enterprise security, and comprehensive documentation. Compared to SAM 3, cloud APIs offer simpler integration for common use cases but lack open-vocabulary flexibility and require ongoing API costs. SAM 3’s open-source nature enables customization and on-premises deployment unavailable with closed cloud services.
OpenCV: The foundational open-source computer vision library providing 2,500+ algorithms for classical vision tasks. OpenCV offers mature, production-tested implementations but relies primarily on traditional computer vision techniques rather than foundation models. SAM 3 provides state-of-the-art deep learning capabilities OpenCV users can integrate for segmentation tasks exceeding classical methods.
3D Reconstruction Technologies
TripoSR: Open-source single-image 3D reconstruction model optimized for rapid generation. TripoSR dominates the open-source single-image AI reconstruction space with fast inference suitable for e-commerce and content creation. Compared to SAM 3D, TripoSR emphasizes speed over maximum quality and specializes in isolated object reconstruction rather than full scenes or human body estimation. TripoSR serves rapid prototyping use cases; SAM 3D targets applications requiring higher fidelity and specialized human reconstruction.
COLMAP + OpenMVS: The gold standard open-source photogrammetry pipeline combining structure-from-motion and multi-view stereo. COLMAP provides exceptional geometric accuracy from dozens of calibrated images. Compared to SAM 3D’s single-image approach, traditional photogrammetry achieves superior precision through multi-view triangulation but requires extensive capture workflows. COLMAP suits surveying and cultural heritage where measurement accuracy is paramount; SAM 3D enables casual reconstruction from available photography.
Agisoft Metashape, RealityCapture, Bentley ContextCapture: Commercial photogrammetry platforms offering professional-grade reconstruction from image sequences. These tools deliver production quality for architecture, engineering, and visual effects with mature workflows and enterprise support. Compared to SAM 3D, commercial photogrammetry achieves higher geometric fidelity through multi-view processing but demands specialized capture and substantial processing time. SAM 3D trades some accuracy for dramatic workflow simplification and instant reconstruction.
Gaussian Splatting, NeRF-based Methods: Neural rendering techniques that reconstruct 3D scenes as continuous functions or point clouds optimized for novel view synthesis. These methods excel at photorealistic rendering but require extensive per-scene optimization and multiple input views. Compared to SAM 3D’s feed-forward inference from single images, neural rendering provides superior view-dependent effects but cannot generalize to unseen objects without retraining.
PIFuHD, ICON, ECON: Specialized human reconstruction methods predicting 3D body shape from single images. These models focus exclusively on human subjects with varying approaches to clothing and geometry detail. SAM 3D Body competes directly in this space, distinguishing itself through the Meta Momentum Human Rig format, prompt-based refinement capabilities, and superior benchmark performance on challenging poses and occlusions.
Matterport, FARO, Leica Geosystems: Hardware-based 3D scanning solutions using LiDAR, structured light, or photogrammetry. These systems deliver millimeter-precision scanning for architecture, construction, and industrial inspection. Compared to SAM 3D’s software-only approach using standard cameras, hardware scanners achieve unmatched geometric accuracy but require expensive specialized equipment and expert operation. SAM 3D democratizes 3D reconstruction for casual users; hardware scanners serve precision engineering applications.
NVIDIA Instant NeRF, Omniverse: NVIDIA’s 3D reconstruction and simulation platforms leveraging GPU acceleration and AI. Instant NeRF provides real-time neural scene reconstruction, while Omniverse offers complete 3D creation workflows. Compared to SAM 3D, NVIDIA solutions target professional creators with comprehensive tool suites and real-time rendering pipelines. SAM 3D positions as accessible research model for rapid prototyping rather than complete production platform.
Key Differentiators
SAM 3’s Unique Position: The integration of open-vocabulary text prompting into segmentation and tracking creates workflows impossible with traditional visual-only models. The unified architecture handling both concept-level detection and visual refinement eliminates tool-switching friction. Meta’s open-source release with comprehensive benchmarks accelerates community innovation while Instagram integration demonstrates production viability.
SAM 3D’s Unique Position: The single-image reconstruction capability combined with specialized human body modeling through the Meta Momentum Human Rig format addresses practical use cases where multi-view capture proves impractical. The prompt-based refinement mechanism enables human-AI collaboration for challenging reconstruction scenarios. The dual-model approach (Objects and Body) provides specialized excellence rather than compromised generality.
Ecosystem Strength: Both models benefit from Meta’s research infrastructure, large-scale data engines, and integration with the broader Segment Anything Collection. The simultaneous release creates synergies—users can segment objects with SAM 3 and reconstruct them with SAM 3D in coordinated workflows.
Pricing and Availability
SAM 3 and SAM 3D are available as open-source releases under Meta’s research licensing terms. Model weights, inference code, and training frameworks are publicly accessible through GitHub repositories (facebook research/sam3 and related repos) and Hugging Face model hubs.
The Segment Anything Playground at ai.meta.com provides browser-based access for experimenting with both models without local installation. Users can upload images, enter text prompts for SAM 3 segmentation, or generate 3D reconstructions with SAM 3D through the web interface.
For production deployment, users download model checkpoints and implement inference pipelines using provided Python APIs. SAM 3 requires modern GPU hardware (NVIDIA A100 or equivalent recommended) for real-time performance, though inference is possible on consumer GPUs with longer latency. SAM 3D similarly demands GPU acceleration for reasonable processing times.
The SA-Co benchmark dataset for evaluating concept segmentation is publicly released to support reproducible research and model comparison.
Commercial use permissions depend on Meta’s specific research license terms, which users should review for their deployment scenarios. Meta’s integration of SAM 3 into Instagram products indicates the company’s own production use beyond pure research.
Future releases may include optimized model variants for edge deployment, additional pre-trained checkpoints, and expanded documentation based on community feedback and use case evolution.
Final Thoughts
SAM 3 and SAM 3D represent significant advances in making sophisticated computer vision accessible and practical for diverse applications. SAM 3’s open-vocabulary concept segmentation transforms an expert-only capability—writing precise prompts for vision models—into natural language interaction approachable for general users. The unified architecture handling images, videos, and hybrid prompting modes delivers flexibility rare in foundation models.
SAM 3D addresses the longstanding friction in 3D reconstruction workflows where capturing dozens of photos from precise angles proves impractical for casual users. Single-image reconstruction, while sacrificing some geometric precision compared to multi-view photogrammetry, enables 3D workflows previously inaccessible without specialized equipment or expertise.
Both models serve distinct but complementary audiences. SAM 3 appeals to computer vision researchers accelerating dataset annotation, content creators requiring advanced editing capabilities, robotics engineers building perception systems, and application developers integrating vision understanding into products. SAM 3D targets game developers rapid-prototyping assets, e-commerce platforms adding 3D product views, AR/VR creators reconstructing real-world content, and animators extracting character rigs from reference imagery.
The open-source release strategy maximizes research impact and community innovation while Meta simultaneously deploys the technology in consumer products like Instagram. This dual approach—advancing academic research while demonstrating production viability—strengthens both the models’ credibility and practical value.
Limitations remain: SAM 3’s computational scaling challenges with high object counts, the emerging state of production tooling, and GPU requirements for real-time performance constrain some deployment scenarios. SAM 3D’s inferred geometry for occluded surfaces, separate models for objects versus humans, and processing time per reconstruction limit applications requiring geometric precision or high-throughput batch processing.
As research models released in November 2025, both SAM 3 and SAM 3D will evolve through community contributions, optimization research, and Meta’s continued development. Early adopters should anticipate model updates, best practices refinement, and expanding documentation as the ecosystem matures.
For developers, researchers, and creators working at the intersection of vision AI and practical applications, both models represent powerful tools worth integrating and experimenting with. The combination of state-of-the-art performance, open-source accessibility, and Meta’s backing creates opportunities to build novel experiences impossible with previous generation computer vision technology.

