Download YouTube Subtitles in Bulk (SRT/VTT)

To download YouTube subtitles in bulk as SRT or VTT files, run an Apify actor that accepts a list of video URLs and outputs timestamped subtitle files for each one. The YouTube Transcriber & Subtitles (JSON/SRT/VTT) actor does this in a single run — you paste up to thousands of YouTube URLs, pick your format, and download the resulting .srt or .vtt files from the dataset. Cost is $0.12 per minute of audio, so a 10-video batch averaging 5 minutes each runs about $6.

Quick Answer

To download YouTube subtitles in bulk with SRT or VTT output, use the YouTube Transcriber & Subtitles actor on Apify. Feed it a list of video URLs, select srt or vtt as the output format, and the actor transcribes each video using Faster-Whisper and saves timestamped subtitle files to the dataset. You don't need YouTube Data API credentials, ffmpeg, or local Python setup. Pricing is pay-per-event at $0.12 per minute of transcribed audio, with discounts on higher subscription tiers.

Why can't you just bulk-download existing YouTube captions?

YouTube technically exposes captions through its watch page and the Data API, but bulk extraction breaks down fast:

Auto-generated captions are often missing timestamps in clean SRT structure and need post-processing.
YouTube Data API v3 rate-limits you to 10,000 quota units per day, and captions.download costs 200 units per call — that's ~50 videos before you're throttled, and only if you own the channel.
yt-dlp works locally but requires you to manage cookies, proxies, and ffmpeg, and YouTube actively blocks repeated IP hits.
Third-party "download SRT" websites handle one video at a time and frequently fail on long videos or non-English audio.

For batches of 20+ videos — or any video without creator-provided captions — you need a transcription pipeline, not just a download tool.

How do you bulk-download YouTube subtitles as SRT files?

Here's the workflow using the YouTube Transcriber & Subtitles actor:

1. Collect your URLs. Export a list of YouTube video URLs into a plain text or JSON array. Channel scrapers can produce this if you start from a channel handle.

2. Configure the actor input. The minimum input looks like this:

{
  "videoUrls": [
    "https://www.youtube.com/watch?v=abc123",
    "https://www.youtube.com/watch?v=def456",
    "https://www.youtube.com/watch?v=ghi789"
  ],
  "outputFormats": ["srt"],
  "language": "auto"
}

3. Run it. Each video is processed independently. A 5-minute video typically completes in 30–90 seconds depending on Whisper model size.

4. Download from the dataset. The actor writes one record per video. Each record contains the SRT content as a string and a link to the stored file. You can pull all files via the Apify CLI:

apify dataset:get <DATASET_ID> --format=json > results.json

Then loop over the records and write each srt field to disk.

How is VTT output different from SRT?

Both SRT and VTT are timestamped subtitle formats, but they differ in a few practical ways:

Feature	SRT	VTT
File extension	`.srt`	`.vtt`
Timestamp format	`00:00:01,000` (comma)	`00:00:01.000` (dot)
Header required	No	Yes (`WEBVTT`)
Styling/positioning	No	Yes (CSS-like cues)
HTML5 `<track>` support	Needs conversion	Native
Best for	Video editors (Premiere, DaVinci, Final Cut)	Web players, HLS streaming

If you're feeding subtitles into a web player, pick vtt. If you're importing into editing software or running them through a translation tool, srt is more universally supported. The actor lets you request both at once:

{ "outputFormats": ["srt", "vtt", "json"] }

You'll get all three files per video in the same dataset record — useful if you're not sure which downstream tool you'll use.

How much does it cost to transcribe 100 YouTube videos?

Pricing is $0.12 per minute of audio processed. Real-world examples:

100 TikTok-length videos (avg 1 min): ~$12
100 tutorial videos (avg 8 min): ~$96
100 podcast episodes (avg 45 min): ~$540
A single 2-hour conference recording: ~$14.40

This is pay-per-event, so you only pay for what runs. If a video fails (private, deleted, age-restricted), you don't pay for it. Higher Apify subscription tiers discount the per-minute rate further — the Team plan typically brings it down meaningfully on volume.

For comparison, OpenAI's Whisper API runs at $0.006 per minute but requires you to handle YouTube audio extraction, retries, chunking for videos over 25 MB, and SRT formatting yourself. Once you factor in engineering time and proxy costs, the gap closes quickly for batches under a few thousand videos.

How do you handle non-English videos?

The actor uses Faster-Whisper, which supports 90+ languages. Two relevant input options:

language: "auto" — Whisper auto-detects the spoken language and outputs subtitles in that same language.
language: "en" (or any ISO code) — Forces the model to assume a specific language, which speeds up the first second of processing and avoids misdetection on short clips.

The actor does not currently translate — if you need English subtitles for a Spanish video, run the transcription first, then pass the SRT through a separate translation step. Keeping these stages separate keeps your raw transcripts auditable.

What does the output dataset look like?

Each record in the Apify dataset contains:

{
  "videoUrl": "https://www.youtube.com/watch?v=abc123",
  "videoTitle": "How to deploy on Vercel",
  "durationSeconds": 412,
  "language": "en",
  "transcript": "Full plain text transcript here...",
  "srt": "1\n00:00:00,000 --> 00:00:03,200\nWelcome back...",
  "vtt": "WEBVTT\n\n00:00:00.000 --> 00:00:03.200\nWelcome back...",
  "json": [
    { "start": 0.0, "end": 3.2, "text": "Welcome back" }
  ],
  "summary": "A short summary of the video content."
}

The included summary field is useful if you're processing dozens of videos and want a quick scan before deciding which ones to read in full. The json field with per-segment timestamps is what you want if you're building a search index or feeding segments into an LLM with citations.

Can you automate this on a schedule?

Yes. Two patterns work well:

Pattern 1: Channel monitoring. Use Apify Schedules to run a channel-scraper actor daily, pipe new video URLs into the transcriber actor via a webhook, and write results to S3 or a database. Total cost for a channel that uploads 1 video per day at 10 minutes average: ~$36/month.

Pattern 2: Batch jobs. Drop a CSV of URLs into a storage bucket, trigger the actor via the Apify API, and post-process results. This is what most subtitle-editing teams do — they accumulate URLs through the week and batch-run on Friday.

API trigger example:

curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/runs?token=<TOKEN>" \
  -H "Content-Type: application/json" \
  -d '{"videoUrls": [...], "outputFormats": ["srt"]}'

FAQ

Q: Can I download subtitles from age-restricted or private YouTube videos? Private videos cannot be transcribed since the audio isn't accessible without authentication. Age-restricted videos sometimes fail depending on YouTube's current gating — expect a ~10-15% failure rate on heavily age-gated content. The actor will mark failed videos in the dataset rather than charging for them.

Q: How accurate is the transcription compared to YouTube's auto-captions? Faster-Whisper generally outperforms YouTube auto-captions on technical jargon, accents, and overlapping speech. On clean English audio, word error rate is typically 4–8% versus YouTube's 8–15%. For music or heavily-accented content, expect both systems to struggle.

Q: Can I get speaker labels (diarization) in the SRT output? The current actor configuration produces single-speaker SRT files without Speaker 1 / Speaker 2 labels. If you need diarization, post-process the JSON output with a separate speaker-segmentation tool like pyannote — the actor's JSON timestamps align cleanly with diarization timestamps.

Q: What's the largest batch I can run in one job? There's no hard cap from the actor side, but practical batches stay under 500 videos per run to keep dataset retrieval manageable. For 1,000+ videos, split into parallel runs or use Apify's run queue. Total wall-clock time for 500 videos averaging 5 minutes is roughly 60–90 minutes depending on concurrency.

Q: Do the SRT files need post-processing before importing into Premiere or DaVinci Resolve? No — the SRT output follows the standard format with comma-separated milliseconds and sequential indexing. Drag the file directly into your editor's subtitle track. The only common adjustment is splitting long subtitle blocks (>42 characters per line) if you're targeting broadcast-style readability.