youtube-mcp-server

youtube-mcp-server

03/01/2026
A powerful Model Context Protocol (MCP) server for YouTube video transcription and metadata extraction. - mourad-ghafiri/youtube-mcp-server
github.com

Overview

The youtube-mcp-server developed by Mourad Ghafiri provides a standardized bridge between AI clients—such as Claude Desktop or Cursor—and YouTube’s vast video library. Launched in late 2025 and updated through early 2026, the server eliminates the need for manual copy-pasting by allowing AI models to “call” tools that fetch video details and text content automatically. It is particularly valued for its use of the OpenAI Whisper engine and Voice Activity Detection (VAD) to ensure high-quality, human-readable transcriptions. By operating as a local or SSE (Server-Sent Events) server, it prioritizes performance and privacy, processing data in-memory without the permanent disk overhead of full video downloads.

Key Features

  • Advanced Metadata Extraction: Retrieves exhaustive video details including title, description, view counts, duration, uploader information, upload date, thumbnails, tags, and categories through the get_video_info tool.
  • OpenAI Whisper Transcription: Generates high-quality speech-to-text using the Whisper model family (tiny to large versions), ensuring accurate capture of spoken content even in challenging audio environments.
  • Multilingual and Translation Support: Supports transcription in 99+ languages and offers a translation feature (e.g., to English) for non-native video content via the transcribe_video tool.
  • Silero Voice Activity Detection (VAD): Employs the Silero VAD engine for precise audio segmentation, ensuring the AI only processes speech and ignores silent segments or background noise for cleaner transcripts.
  • In-Memory Processing Pipeline: Optimized for efficiency by handling data in-memory, which speeds up the transcription process and minimizes local disk I/O.
  • Intelligent File-Based Caching: Includes a local caching system (defaulting to a transcriptions directory) to store previously processed data, preventing redundant API calls and re-transcription of the same URLs.
  • Hardware Acceleration Support: Can be configured to leverage GPU acceleration for Whisper inference, significantly reducing the time required for transcribing long-form video content.

How It Works

The server functions by exposing specific “tools” to an MCP-compliant AI client. When a user provides a YouTube URL to an assistant like Claude, the client sends a request to the youtube-mcp-server. For metadata, the server uses yt-dlp to fetch video details without a full download. For transcription, the server extracts the audio stream, processes it through Silero VAD to identify speech segments, and then runs those segments through a Whisper model (such as base or medium). The resulting text is then sent back to the AI assistant’s context. The server can run as a local process via stdio or as a persistent web service using SSE at a local address like http://127.0.0.1:8000.

Use Cases

  • AI-Driven Video Summarization: Enabling Claude or ChatGPT to read a full transcript of a 2-hour lecture to provide a concise bulleted summary or key takeaways.
  • Tutorial Step Extraction: Automatically converting a video tutorial (e.g., a cooking or coding video) into a structured, step-by-step written guide.
  • Content Metadata Databases: Building automated pipelines to categorize and tag thousands of videos for research or archival purposes.
  • Sentiment and Topic Analysis: Running AI analysis on transcripts to detect the tone of a video or identify specific mentioned entities and brands.
  • Multilingual Research: Translating and summarizing foreign-language news or educational videos to stay informed across language barriers.

Pros & Cons

Advantages

  • High Transcription Quality: Local use of Whisper often produces better results than YouTube’s default auto-generated captions, especially for technical or accented speech.
  • Privacy and Efficiency: Processes audio in-memory and only fetches the necessary streams, avoiding the privacy risks and disk usage of third-party “YouTube downloader” websites.
  • Standardized Protocol: Being MCP-compliant means it works instantly with any modern AI IDE (Cursor) or assistant (Claude) that supports the protocol.

Disadvantages

  • Technical Setup Required: Requires installation of Python, uv, and configuration of environment variables, which may be difficult for non-technical users.
  • Resource Intensive: Running Whisper models (especially medium or large) requires significant CPU/GPU and RAM, potentially slowing down the host machine during processing.
  • YouTube Availability Dependence: Like all such tools, it depends on the stability of yt-dlp and YouTube’s internal APIs, which are subject to frequent changes.

How Does It Compare?

The MCP ecosystem features several specialized tools for interacting with YouTube, ranging from lightweight scrapers to heavy-duty transcription engines.

YouTube Transcript Server (Go Version)

Implemented in the Go language, this version focuses on high performance and concurrency. It excels at batch processing multiple videos and offers native support for Redis caching and proxy rotation. While faster for high-volume metadata, it typically relies on existing YouTube subtitles rather than generating new ones via Whisper, making it better for speed but potentially lower in transcription accuracy for videos without manual captions.

TranscriptAPI MCP

A cloud-integrated solution that connects to a dedicated external API service. It provides a more “plug-and-play” experience with structured JSON data and handles rate limits and proxying on the backend. However, it usually involves a subscription or usage-based cost and requires sending your data to a third-party service, unlike the local-first approach of Mourad Ghafiri’s server.

Fetch MCP / Web Scraper MCP

General-purpose web scraping servers that can “see” a YouTube page’s HTML. These are useful for fetching basic metadata like titles or reading comments, but they cannot “hear” the video. They are inadequate for full transcription or deep video analysis but useful as a lightweight alternative for simple metadata fetching.

Yt-dlp MCP

A direct wrapper around the command-line tool yt-dlp. It provides the most comprehensive access to raw video data, formats, and metadata. While powerful, it lacks the built-in AI transcription (Whisper) logic found in the specialized youtube-mcp-server, requiring the user to send raw files to the AI rather than processed text.

Final Thoughts

youtube-mcp-server stands out as one of the most comprehensive local solutions for AI video analysis in the 2026 MCP ecosystem. By combining yt-dlp for metadata with Whisper and Silero VAD for transcription, it provides a “full-stack” video understanding capability that few other servers match. While the hardware requirements and technical setup might be barriers for some, the resulting ability to have an AI assistant “watch” and “listen” to any video with high accuracy makes it an essential tool for researchers, developers, and power users of the Model Context Protocol.

A powerful Model Context Protocol (MCP) server for YouTube video transcription and metadata extraction. - mourad-ghafiri/youtube-mcp-server
github.com