SAM Audio

SAM Audio

19/12/2025
With SAM Audio, you can use simple text prompts to accurately separate any sound from any audio or audio-visual source.
ai.meta.com

Overview

Tired of wrestling with audio separation, only to find your results are a jumbled mess of unwanted noise? Meet SAM Audio, a revolutionary unified model designed to tackle any sound from any source with unparalleled precision. Whether you’re a seasoned audio engineer or a budding content creator, SAM Audio promises to streamline your workflow and unlock new creative possibilities by effortlessly isolating speech, music, and sound effects.

Key Features

SAM Audio stands out with its innovative approach to audio separation, offering a suite of powerful features:

  • Text and Visual-Based Sound Separation: Go beyond simple audio analysis. Isolate sounds by typing a descriptive text prompt (e.g., “dog barking”), visually clicking on specific elements within a video, or defining precise timeframes.
  • Handles Speech, Music, and Effects: This unified model is adept at separating all core audio components, ensuring you get clean, distinct tracks for vocals, melodies, and ambient sounds.
  • Works Across Media Formats: Whether your source is an audio file or a video, SAM Audio is equipped to handle it, making it incredibly versatile for various projects.
  • Single Model for All Sound Types: Eliminate the need for multiple specialized tools. SAM Audio’s singular, promptable model simplifies the process and ensures consistent results across all your audio separation needs.

How It Works

Getting started with SAM Audio is remarkably straightforward. The process is designed for efficiency and ease of use:

  1. Upload Your Media: Begin by uploading your audio or video file directly into the SAM Audio platform.
  2. Choose Your Isolation Method: Select your preferred method for identifying the sound you want to extract. This could be by entering a descriptive text prompt, using your cursor to visually pinpoint the sound in a video, or specifying a particular segment of time.
  3. AI Does the Work: SAM Audio’s intelligent AI then processes your request, meticulously separating the chosen sound from the rest of the audio.
  4. Receive Your Cleaned File: The AI outputs a clean, isolated sound file, ready for immediate use in your projects.

Use Cases

The versatility of SAM Audio opens up a world of applications for creators and professionals alike:

  • Video and Audio Editing: Effortlessly remove background noise, isolate dialogue, or extract specific sound effects to enhance your video productions.
  • Podcast Production: Clean up interviews by separating voices from ambient room noise or unwanted sounds, ensuring crystal-clear audio for your listeners.
  • Sound Design and Remixing: Extract individual musical elements or sound effects for creative remixing, sampling, or building entirely new soundscapes.
  • Accessibility or Transcription Support: Isolate speech from complex audio mixes to improve the accuracy of transcriptions or to make content more accessible for individuals with hearing impairments.

Pros \& Cons

Advantages

  • Versatile: Its ability to handle various isolation methods and sound types makes it a one-stop solution.
  • Accurate: The unified model delivers precise separation, minimizing unwanted artifacts.
  • Works Across Sound Types: Seamlessly separates speech, music, and sound effects within a single process.

Disadvantages

  • Requires High Processing Power: To achieve its impressive accuracy, SAM Audio demands significant computational resources.
  • Data Access: Effective utilization may depend on the accessibility and quality of the data it processes.

How Does It Compare?

In the competitive landscape of audio separation tools, SAM Audio positions itself as a powerful contender. It directly competes with established names like Demucs, Spleeter, and Audioshake. While these tools offer robust audio separation capabilities, SAM Audio’s key differentiator lies in its unified, promptable model that integrates text and visual-based isolation alongside traditional time-based methods, offering a more intuitive and flexible user experience across all sound types.

Final Thoughts

SAM Audio represents a significant leap forward in audio separation technology. Its unified approach, coupled with innovative text and visual prompting, simplifies complex audio tasks, making it an indispensable tool for anyone serious about audio quality and creative control. While it demands robust processing power, the accuracy and versatility it offers are well worth the investment for professionals and hobbyists alike.

SAM Audio

With SAM Audio, you can use simple text prompts to accurately separate any sound from any audio or audio-visual source.
ai.meta.com

Overview

SAM Audio is Meta’s first unified multimodal model for audio separation, announced on December 16, 2025. The model isolates specific sounds from complex audio mixtures using text, visual, or temporal prompts. Unlike traditional audio separation tools that require separate models for different sound types, SAM Audio handles speech, music, and general sound effects within a single architecture. The system processes audio faster than real-time (real-time factor ~0.7) and scales efficiently across model sizes from 500 million to 3 billion parameters.

Key Features

SAM Audio delivers audio separation through three integrated prompting methods:

  • Text-Based Sound Separation: Input natural language descriptions such as “dog barking” or “singing voice” to isolate target sounds from mixed audio.
  • Visual-Based Sound Separation: Click on sound-producing objects or people within video frames to extract corresponding audio tracks using frame-level visual features.
  • Span-Based Temporal Prompting: Define specific time ranges to isolate sounds occurring within precise segments—an industry-first capability for targeted audio extraction.
  • Unified Model Architecture: Single foundation model eliminates need for multiple specialized tools, ensuring consistent performance across speech, music, and sound effect separation.
  • Cross-Media Format Support: Processes both standalone audio files and video files, extracting audio tracks for analysis.
  • Dual Output Generation: Produces two waveforms per request—target (isolated sound) and residual (everything else)—enabling both extraction and removal workflows.

How It Works

The separation process operates through four stages:

  1. Upload Your Media: Submit audio or video files through the Segment Anything Playground or API.
  2. Select Prompt Type: Choose text description, visual click on video, time span specification, or combine multiple prompt types for precise targeting.
  3. AI Processing: The diffusion transformer architecture processes the audio mixture with encoded prompts, applying self-attention and cross-attention mechanisms.
  4. Receive Isolated Audio: Download the separated target sound and residual audio as independent waveform files.

Use Cases

SAM Audio applies to diverse professional workflows:

  • Video and Audio Editing: Remove background noise, isolate dialogue, or extract specific sound effects for post-production enhancement.
  • Podcast Production: Clean interviews by separating voices from ambient room noise, HVAC systems, or unexpected interruptions like animal sounds.
  • Music Production and Remixing: Extract individual instruments or vocal stems from mixed recordings for sampling, remixing, or rights clearance analysis.
  • Film and Television: Isolate dialogue from set noise, separate Foley effects from production audio, or extract specific sonic elements for sound design.
  • Scientific Research: Analyze specific animal vocalizations in field recordings, isolate machinery sounds in acoustic studies, or process audio data for machine learning training.
  • Accessibility Enhancement: Improve transcription accuracy by isolating speech from complex audio mixes, or create enhanced listening experiences for hearing-impaired users.

Pros \& Cons

Advantages

  • Multimodal Prompting: Three distinct input methods provide intuitive control that mirrors natural human sound recognition.
  • State-of-the-Art Performance: Achieves performance comparable to or exceeding specialized single-purpose models across speech, music, and general sound categories.
  • Processing Speed: Real-time factor of ~0.7 enables faster-than-real-time processing, supporting both cloud and potential edge deployment scenarios.
  • Comprehensive Evaluation Framework: Includes SAM Audio-Bench benchmark covering all major audio domains and SAM Audio Judge for objective quality assessment without reference tracks.
  • Open Availability: Model checkpoints (small, base, large) and code released for research and commercial use, enabling custom application development.

Disadvantages

  • Prompt Dependency: Requires explicit prompts for separation; cannot automatically identify and separate all sound sources without user guidance.
  • No Audio-Based Prompting: Cannot use an existing audio sample as a reference prompt, limiting certain sound-matching workflows.
  • Computational Requirements: Large model variant (3B parameters) requires significant GPU memory for local deployment, though smaller variants offer efficiency trade-offs.
  • Quality Variation: Performance varies across sound categories; general sound effects show lower subjective scores (3.50) compared to professional instrument separation (4.49).
  • Legal and Ethical Considerations: Potential for misuse in surveillance or unauthorized audio extraction requires responsible deployment guidelines.

How Does It Compare?

SAM Audio vs. Demucs

Demucs (Meta’s earlier music separation model):

  • Specialization: Optimized exclusively for music source separation (vocals, drums, bass, other)
  • Architecture: Uses convolutional U-Net architecture with hybrid time-frequency domain processing
  • Performance: Strong musical instrument separation but limited to four predefined stems
  • Prompting: No natural language or visual prompting; separates all sources automatically
  • Use Case Fit: Music production only; cannot handle general sound effects or speech-specific tasks
  • Key Difference: SAM Audio’s multimodal prompting enables targeted extraction of any sound, while Demucs performs blind separation into fixed categories

SAM Audio vs. Spleeter

Spleeter (Deezer’s open-source separation library):

  • Model Variants: Offers 2-stem and 4-stem separation models
  • Speed: Very fast inference optimized for batch processing
  • Quality: Good separation quality but produces artifacts in complex mixes
  • Flexibility: Fixed output stems (vocals/accompaniment or vocals/drums/bass/other)
  • Integration: Command-line tool and Python library; no visual interface
  • Key Difference: SAM Audio provides granular control via prompts and handles arbitrary sound types beyond music, while Spleeter offers only predefined separation

SAM Audio vs. AudioShake

AudioShake (Commercial audio separation service):

  • Service Model: API-based commercial service with per-minute pricing
  • Capabilities: Professional-grade stem separation for music licensing and sync
  • Quality: High-quality results optimized for commercial music catalog processing
  • Cost: Commercial pricing structure; not freely available for research
  • Accessibility: Requires API integration; no direct user interface for experimentation
  • Key Difference: SAM Audio offers open-source accessibility and multimodal prompting, while AudioShake focuses on commercial music applications with enterprise-grade reliability

SAM Audio vs. Ultimate Vocal Remover (UVR)

Ultimate Vocal Remover (Open-source GUI application):

  • Interface: Desktop application with graphical user interface
  • Models: Bundles multiple pre-trained models for vocal removal
  • Processing: Local processing on consumer hardware
  • Specialization: Primarily focused on vocal/instrumental separation
  • Ease of Use: Point-and-click interface for non-technical users
  • Key Difference: SAM Audio provides unified model with multiple prompt types and handles broader sound categories; UVR focuses specifically on vocal extraction with simpler interface but less flexibility

Final Thoughts

SAM Audio represents a paradigm shift in audio separation by unifying multiple prompting modalities within a single foundation model. The integration of text, visual, and temporal cues provides intuitive control that aligns with natural human sound recognition processes, significantly lowering the barrier to precise audio editing.

The model’s open release strategy, comprehensive evaluation frameworks, and performance across diverse audio categories position it as a foundational tool for both research and commercial applications. While specialized tools may still excel in narrow domains, SAM Audio’s versatility enables new workflows that previously required chaining multiple tools or manual editing.

Organizations should evaluate SAM Audio based on their specific use cases: media production teams gain rapid prototyping capabilities, researchers obtain a reproducible baseline for audio analysis tasks, and developers can build custom applications without training separation models from scratch. The availability of multiple model sizes facilitates deployment across resource constraints, from cloud servers to potential edge devices.

As with any powerful AI capability, responsible deployment requires consideration of ethical implications, particularly regarding privacy and consent in audio recording contexts. Meta’s provision of the model under open terms necessitates that implementers establish appropriate governance frameworks for their specific applications.

With SAM Audio, you can use simple text prompts to accurately separate any sound from any audio or audio-visual source.
ai.meta.com