Transcribe YouTube Videos Automatically in Any Language

To transcribe YouTube videos automatically in multiple languages, run the video URL through a Whisper-based actor that auto-detects the spoken language and exports timestamped SRT, VTT, or JSON files. Whisper supports 90+ languages natively, including Spanish, Mandarin, Hindi, Arabic, French, German, and Japanese, with no manual language flag needed. This approach beats YouTube's auto-captions because it handles accents, code-switching, and technical jargon far better — and you get clean subtitle files instead of scraping the captions UI.

Quick Answer

The fastest way to transcribe YouTube videos automatically across languages is to use a Faster-Whisper-powered Apify actor that accepts video URLs and returns transcripts in TXT, JSON, SRT, and VTT. Faster-Whisper auto-detects 90+ source languages, so a single workflow handles Spanish, Korean, or Portuguese videos without separate configs. Output includes per-segment timestamps, making it usable for subtitles, search indexing, or translation pipelines. Pricing runs about $0.12 per audio minute, so a 10-minute video costs roughly $1.20. You skip Google Cloud Speech, AssemblyAI keys, and ffmpeg setup entirely.

Why not just use YouTube's built-in auto-captions?

YouTube's auto-generated captions exist, but three issues kill them for any serious workflow:

Accuracy drops sharply on non-English audio. Word error rates for YouTube auto-captions in Spanish hover around 18–25%, vs. 8–12% for Whisper-large. For Hindi and Arabic, the gap widens further.
No reliable API. YouTube's caption endpoints are unstable, region-restricted, and frequently return empty tracks even when captions are visible in the player.
No segment-level JSON. You get a single subtitle file in one format, not structured data with confidence scores, language detection, or summary fields.

Running your own Whisper-based pipeline solves all three. Faster-Whisper (a CTranslate2 reimplementation of OpenAI Whisper) runs 4x faster than vanilla Whisper at the same accuracy, which is why most production transcription stacks use it.

How do I transcribe a YouTube video in any language automatically?

The shortest path:

Grab the YouTube URL.
POST it to the YouTube Transcriber & Subtitles (JSON/SRT/VTT) actor.
Pull the resulting dataset — you'll have a .txt, .json, .srt, and .vtt plus a summary.

Minimal Apify API call:

curl -X POST "https://api.apify.com/v2/acts/YOUR_ACTOR_ID/runs?token=$APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "videoUrl": "https://www.youtube.com/watch?v=dQw4w9WgXcQ",
    "outputFormats": ["srt", "vtt", "json", "txt"]
  }'

The actor downloads the audio track, runs Faster-Whisper locally on the Apify worker (no external API key needed), and writes results to the run's dataset and key-value store. Language detection happens in the first ~30 seconds of audio, so you don't pass a language code.

What the JSON output looks like

{
  "language": "es",
  "language_probability": 0.98,
  "duration": 612.4,
  "segments": [
    {
      "start": 0.0,
      "end": 4.32,
      "text": "Bienvenidos al canal, hoy hablamos de inteligencia artificial."
    }
  ],
  "summary": "El video introduce conceptos básicos de IA generativa..."
}

The language field is ISO 639-1, and language_probability lets you flag low-confidence detections (e.g., < 0.7) for manual review — useful when videos mix languages.

Which languages does Whisper-based transcription support?

Faster-Whisper supports the same 99 languages as OpenAI's Whisper-large-v3, with usable accuracy on roughly 60 of them. Categories by quality tier:

Tier 1 (WER under 10%): English, Spanish, Italian, German, Portuguese, Polish, Catalan, Dutch
Tier 2 (WER 10–20%): French, Russian, Chinese, Japanese, Korean, Turkish, Ukrainian, Czech, Swedish
Tier 3 (WER 20–35%): Hindi, Arabic, Vietnamese, Thai, Indonesian, Tamil, Romanian
Tier 4 (usable but verify): Welsh, Swahili, Bengali, Punjabi, Yoruba

For comparison, Google Cloud Speech-to-Text charges $0.024/minute but requires you to specify the language up front and has worse zero-shot accuracy on Tier 3 languages.

How do I handle multilingual videos that switch languages?

For videos where speakers switch between, say, English and Mandarin (common in tech interviews and tutorials), Whisper handles it surprisingly well within a single pass. The model transcribes each segment in its detected language and writes mixed-language SRT files cleanly.

Two practical tactics:

Use the task: transcribe setting (default) to preserve the original language per segment. Don't set task: translate unless you want everything forced into English.
Post-process by detecting per-segment language. Run langdetect or fasttext over each text field in the JSON output if you need a language tag per subtitle line.

If your video is 80% Korean with English code-switching (e.g., "이거 deploy 했어"), Whisper-large keeps the English words as English characters — exactly what you want for searchable transcripts.

How do I translate a YouTube video transcript into another language?

Two options, depending on whether you want raw translation or production-quality:

Option 1: Whisper's built-in translate mode. Set task: translate and Whisper outputs English regardless of source language. Accuracy is decent for major languages but rough for Tier 3/4.

Option 2: Transcribe first, then translate with a dedicated model. Run the actor in default transcribe mode to get the source-language SRT, then pipe segments through DeepL ($0.0025 per 1,000 chars) or GPT-4o-mini ($0.15/1M input tokens). This costs pennies for a 10-minute video and gives you SRT files in any target language with timestamps preserved.

Pseudo-pipeline:

import requests, json

# 1. Get transcript from Apify dataset
transcript = requests.get(dataset_url).json()

# 2. Translate each segment
for seg in transcript["segments"]:
    seg["text_es"] = translate(seg["text"], target="es")

# 3. Write new SRT preserving start/end timestamps
write_srt(transcript["segments"], "output.es.srt", text_field="text_es")

You now have synchronized subtitles in any target language for ~$0.05 of translation cost on top of the $0.12/min transcription.

How much does automatic YouTube transcription cost at scale?

At $0.12 per audio minute, the actor's economics:

Volume	Audio minutes	Cost
1 podcast episode (60 min)	60	$7.20
Daily YouTube channel (1 hr/day × 30)	1,800	$216
100-video corpus (avg 12 min)	1,200	$144
1,000-video research dataset (avg 8 min)	8,000	$960

Higher Apify subscription tiers discount this further. For comparison:

AssemblyAI Universal-2: $0.27/min ($0.37/min with diarization)
Deepgram Nova-3: $0.0043/min audio (cheapest, but you build the YouTube downloader yourself)
Rev.ai: $0.02/min (but slower, English-focused)
OpenAI Whisper API: $0.006/min (just transcription — no YouTube download, no SRT/VTT formatting, no summary)

The actor's $0.12/min covers the full pipeline: YouTube audio extraction, transcription, formatting into four output types, and a generated summary. Building the equivalent yourself takes a day of yt-dlp + Whisper + ffmpeg plumbing, plus infra to run it.

How do I batch process many YouTube videos?

For a channel-wide scrape, queue URLs to the actor in parallel. Apify runs each as a separate actor run, and you can fan out to 10+ concurrent runs on the starter tier.

import requests, os

token = os.environ["APIFY_TOKEN"]
videos = ["https://youtu.be/...", "https://youtu.be/...", ...]

for url in videos:
    requests.post(
        f"https://api.apify.com/v2/acts/YOUR_ACTOR_ID/runs?token={token}",
        json={"videoUrl": url, "outputFormats": ["json", "srt"]}
    )

Then poll /v2/acts/YOUR_ACTOR_ID/runs?status=SUCCEEDED and pull each dataset. For a 100-video channel with average 10-minute videos, expect total wall-clock time of 30–60 minutes and a bill around $120.

FAQ

Q: Does this actor work on YouTube Shorts and live streams? Shorts work identically to regular videos — just pass the short URL. Live streams need to be finished and archived first; you can't transcribe an active livestream because there's no complete audio track to download.

Q: Can I transcribe a YouTube video without captions enabled by the creator? Yes. The actor downloads the audio track directly and runs Whisper on it, so it doesn't depend on whether the creator enabled captions. This is the main reason to use Whisper-based transcription over scraping YouTube's caption tracks.

Q: How accurate is Whisper for languages like Hindi or Arabic compared to English? Word error rate for Hindi typically lands at 18–28% with Whisper-large-v3, vs. 4–8% for clean English. Arabic is similar. For both, results are noticeably better than YouTube's auto-captions but expect to review technical terms and proper nouns.

Q: Does the output include speaker diarization (who said what)? The default actor output doesn't include speaker labels — Whisper itself doesn't do diarization. For interviews and podcasts where you need speaker tags, post-process the audio with pyannote.audio or use a diarization-capable service like AssemblyAI for that specific use case.

Q: Can I get word-level timestamps instead of segment-level? Yes — Faster-Whisper supports word-level timestamps via the word_timestamps: true option. This adds a few percent to processing time but gives you per-word start/end fields inside each segment, which is essential for karaoke-style captions or precise video clipping based on keywords.