You can pull YouTube transcripts programmatically without touching the official YouTube Data API by scraping the auto-generated caption tracks directly, or by running the video's audio through a speech-to-text model like Whisper. The fastest path is the youtube-transcript-api Python package for videos that already have captions, and a Whisper-based actor for videos that don't. Neither approach requires a Google Cloud project, OAuth, or quota management.
Quick Answer
To use a YouTube transcript API without the official YouTube Data API, install the youtube-transcript-api Python package (pip install youtube-transcript-api) and call YouTubeTranscriptApi.get_transcript(video_id) — it scrapes the public caption XML endpoint directly. For videos without captions, or when you need timestamped SRT/VTT output, run the audio through Faster-Whisper via an Apify actor like YouTube Transcriber & Subtitles (JSON/SRT/VTT) at $0.12 per minute. Both skip Google API quotas, OAuth tokens, and billing setup entirely. Rate limits are the main risk with the scraping approach — proxies or a managed actor solve that.
Why avoid the official YouTube Data API?
The YouTube Data API v3 has three friction points that push developers to alternatives:
- Quotas. You get 10,000 units per day by default. A
captions.downloadcall costs 200 units, capping you at 50 downloads/day without a quota increase request. - OAuth required for captions. Unlike most read operations,
captions.downloadrequires OAuth 2.0 with the video owner's consent. You cannot download captions for arbitrary third-party videos through the official API. - No audio access. The official API does not expose raw audio or video streams, so you cannot generate your own transcripts from videos that lack captions.
The youtube-transcript-api package on PyPI pulls in hundreds of thousands of downloads per month precisely because it bypasses all three limits for the common case: reading publicly available captions.
How does youtube-transcript-api work under the hood?
The library hits the same internal endpoint (https://www.youtube.com/api/timedtext) that the YouTube web player uses to render captions. Here's the minimum working example:
from youtube_transcript_api import YouTubeTranscriptApi
video_id = "dQw4w9WgXcQ"
transcript = YouTubeTranscriptApi.get_transcript(video_id)
for entry in transcript:
print(f"[{entry['start']:.2f}s] {entry['text']}")
Output shape:
[
{"text": "We're no strangers to love", "start": 18.8, "duration": 4.2},
{"text": "You know the rules and so do I", "start": 23.0, "duration": 3.5}
]
You get text plus timestamps, no API key required. The library also supports language selection (languages=['en', 'de']), translated tracks, and listing all available caption tracks via list_transcripts().
Limits:
- Works only on videos with captions (auto-generated or manual).
- YouTube rate-limits the endpoint per IP — expect 429s after a few hundred rapid requests.
- Breaks whenever YouTube changes its internal endpoint (happens once or twice a year).
What if the video has no captions?
Roughly 10–15% of YouTube videos lack caption tracks entirely — older uploads, music videos, non-English content in smaller languages, and videos where the uploader disabled captions. For those, you need speech-to-text.
The two-step pipeline:
- Download the audio.
yt-dlpextracts audio in seconds:yt-dlp -x --audio-format mp3 <url>. - Transcribe it. Run through OpenAI Whisper, Faster-Whisper, AssemblyAI, Deepgram, or similar.
Self-hosted cost example for a 10-minute video:
- GPU time on a consumer RTX 3090 with Faster-Whisper
large-v3: ~40 seconds. - Electricity + amortized hardware: roughly $0.01.
- Engineering time to maintain the pipeline: significant.
Cloud STT cost for the same 10 minutes:
- AssemblyAI: $0.37 ($0.037/min)
- Deepgram Nova-2: $0.43 ($0.0043/min × 10 × standard tier)
- OpenAI Whisper API: $0.06 ($0.006/min)
None of these include the yt-dlp download step or SRT/VTT formatting — you still have to wire that up yourself.
How do I get YouTube transcripts at scale without maintaining a pipeline?
If you're transcribing more than a few videos a week, or you need SRT/VTT output with proper timestamp formatting, running a managed actor is cheaper than building it.
YouTube Transcriber & Subtitles (JSON/SRT/VTT) handles the whole chain — URL in, transcript out — using Faster-Whisper. You pass a list of YouTube URLs and get back:
transcript.txt— plain texttranscript.json— word-level timestampssubtitles.srt— standard subtitle filesubtitles.vtt— WebVTT for HTML5<video>tags- A summary entry in the Dataset
Pricing: $0.12 per minute of audio ($120 per 1,000 minutes), pay-per-event. A 10-minute video costs $1.20. A 60-minute podcast costs $7.20. Higher Apify subscription tiers discount this further.
Minimal call via the Apify API:
curl -X POST "https://api.apify.com/v2/acts/<actor-id>/runs?token=$APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"videoUrls": ["https://www.youtube.com/watch?v=dQw4w9WgXcQ"],
"outputFormats": ["json", "srt", "vtt"]
}'
When the scraping approach with youtube-transcript-api breaks (YouTube changes the endpoint, or you hit rate limits on 5,000 videos), the actor keeps working because it processes the audio stream, not the caption API.
How do I avoid getting rate-limited by YouTube?
If you stick with youtube-transcript-api at scale, YouTube will IP-ban your scraper once you exceed ~200–500 requests from a single IP in a short window. Three mitigations:
- Residential proxy rotation. Pass a proxy to the library:
YouTubeTranscriptApi.get_transcript(video_id, proxies={'https': 'http://user:pass@proxy:8080'}). Cost: $3–15 per GB from Bright Data, Smartproxy, or Oxylabs. - Backoff and caching. Cache transcripts aggressively — they don't change. Use
time.sleep(random.uniform(1, 3))between requests. - Delegate to a managed actor. Apify actors run on rotating infrastructure and handle proxies for you, which is the main reason the pay-per-minute model works out cheaper than building proxy rotation yourself.
For reference, a single residential proxy plan starts at $50/month. At the actor's $0.12/min rate, that's equivalent to transcribing ~7 hours of video — and you skip all the DevOps.
Is scraping YouTube transcripts legal?
Gray area. YouTube's Terms of Service prohibit accessing the service "through any means other than our publicly supported interface." However:
- Captions are publicly served to every browser visiting a video page.
- No US court has ruled that scraping publicly displayed text violates the CFAA post-hiQ Labs v. LinkedIn (2022).
- Transcripts themselves may be copyrighted by the uploader — using them commercially (training models, republishing content) is a separate issue.
Practical guidance: personal research and internal analytics are low-risk. Republishing transcripts verbatim or building a commercial product on scraped data is higher risk. When in doubt, use videos under Creative Commons licenses (YouTube lets creators opt in) or get permission.
Which approach should I pick?
| Scenario | Best tool | Cost |
|---|---|---|
| 10 videos with captions, one-off script | youtube-transcript-api | Free |
| 1,000+ videos, captions exist, own infra | youtube-transcript-api + residential proxies | ~$50–200/mo |
| Videos without captions, self-hosted | yt-dlp + Faster-Whisper on GPU | Hardware + time |
| Videos without captions, managed | Apify YouTube Transcriber actor | $0.12/min |
| Need SRT/VTT + JSON out of the box | Apify actor | $0.12/min |
| Official API with full compliance | YouTube Data API v3 | Free but OAuth-gated |
Most teams end up mixing: youtube-transcript-api for the 85% of videos with captions, and a Whisper-based actor for the remaining 15%.
FAQ
Q: Does youtube-transcript-api require an API key? No. It scrapes YouTube's public caption endpoint directly, so there's no key, no OAuth, and no billing account. You pay only in the form of potential rate limits if you hit the same IP too hard.
Q: Can I get auto-generated captions in languages other than English?
Yes. Pass languages=['es', 'fr', 'de'] to get_transcript() and it returns the first available match. You can also call list_transcripts(video_id) to enumerate every available track, including translations YouTube generates on the fly.
Q: What's the accuracy difference between YouTube's auto-captions and Whisper?
On clean English speech the gap is small — both hit 90–95% word accuracy. Whisper large-v3 pulls ahead on accented speech, multiple speakers, and non-English audio, where YouTube auto-captions often drop to 70–80%. For music or heavy background noise, neither is reliable.
Q: How much does it cost to transcribe a 1-hour video with the Apify actor? At $0.12 per minute, a 60-minute video costs $7.20. A 10-minute tutorial costs $1.20. The YouTube Transcriber actor bills per event, so you only pay for actual audio processed — no subscription minimum.
Q: Will youtube-transcript-api break if YouTube updates its site?
Occasionally, yes. YouTube has changed the timedtext endpoint response format a few times, which required library updates within a day or two. Check the package's GitHub issues if you suddenly see parsing errors — it's almost always an upstream change, and a pip install -U youtube-transcript-api fixes it.