YouTube Video to Text: Free & Paid Tools Compared

The fastest way to convert a YouTube video to text is to paste the URL into a free tool like Tactiq.io, YouTubeToTranscript.com, or NoteGPT — all three pull the auto-generated captions in under 10 seconds. For videos without captions, or for batch processing hundreds of videos, run an Apify actor with Faster-Whisper at $0.12 per minute. This guide compares both routes so you pick the right tool for one-off notes versus scaled content repurposing.

Quick Answer

To convert a YouTube video to text for free, paste the video URL into Tactiq.io, YouTubeToTranscript.com, or NoteGPT.cc — they extract YouTube's existing auto-captions in seconds and let you copy plain text. These tools only work when captions exist; for videos without captions or in low-resource languages, you need an actual ASR (automatic speech recognition) pipeline like Whisper. For bulk work — say, transcribing 50+ podcasts, lectures, or competitor videos — the YouTube Transcriber & Subtitles (JSON/SRT/VTT) actor runs Faster-Whisper on Apify and outputs TXT, SRT, VTT, and JSON with timestamps for $0.12/minute.

What's the best free tool to convert YouTube videos to text?

Three free tools dominate this space, and they all work the same way under the hood: they scrape YouTube's auto-generated caption track.

Tactiq.io — Chrome extension that overlays a transcript next to the video player. Best UX, includes a "copy with timestamps" button, and free tier covers 5 AI-summarized transcripts per month (unlimited basic transcripts). Works only inside Chrome.

YouTubeToTranscript.com — Pure web tool, paste URL, get plain text. No login. No timestamps unless you tick a box. Cleanest output for pasting into ChatGPT.

NoteGPT.cc — Adds AI summarization on top of the raw transcript. Free tier is generous (around 10 summaries/day), and it handles long-form video (2+ hour podcasts) better than Tactiq.

Downloader.tools / SaveSubs — Lets you download the .srt subtitle file directly, which is useful if you want timestamps for video editing.

All four share the same limitation: if the video has no captions, you get nothing. About 60% of YouTube videos have decent auto-captions, but channels under ~10K subs, livestream replays, and non-English videos frequently come back blank.

How do I get a transcript from a YouTube video that has no captions?

This is where free tools break down and you need a real speech-to-text model. Three options:

Download the audio with yt-dlp, then run Whisper locally. Free, but requires Python, FFmpeg, a GPU for reasonable speed, and ~10 GB of model weights. A 1-hour video takes 15–25 minutes on an M1 Mac with large-v3.
Use OpenAI's Whisper API. $0.006/minute ($0.36/hour). Cheap, but you still have to download the audio yourself, handle 25 MB file limits (chunk long videos), and write the orchestration code.
Use a hosted Apify actor. The YouTube Transcriber & Subtitles (JSON/SRT/VTT) handles download + transcription + output formatting in one call. $0.12/minute, which is ~20× the raw Whisper API cost — but you skip writing yt-dlp wrappers, retry logic, ffmpeg chunking, and SRT formatting.

Math for a content team: if your dev's time costs $80/hour and the pipeline takes 6 hours to build and maintain, you break even versus the actor at around 4,000 minutes of audio. Below that volume, the actor is cheaper.

How to convert YouTube video to text in bulk

Tactiq and friends are one-video-at-a-time. For repurposing workflows — turning a year of podcast episodes into blog posts, scraping competitor webinars, building a searchable knowledge base — you need a pipeline.

Here's a working setup using the Apify actor:

{
  "videos": [
    "https://www.youtube.com/watch?v=VIDEO_ID_1",
    "https://www.youtube.com/watch?v=VIDEO_ID_2"
  ],
  "outputFormats": ["txt", "srt", "json"],
  "model": "faster-whisper-medium",
  "language": "auto"
}

The actor returns each transcript to the Apify Dataset, which you can pull via REST or push to Google Sheets, Notion, or Airtable through Apify's integrations. JSON output includes word-level timestamps, which is what you want if you're chapterizing video or building a "click to jump to this quote" feature.

For 100 videos averaging 20 minutes each (2,000 minutes total): cost is $240 and total wall-clock time is roughly 90 minutes since Apify runs videos in parallel.

Can I use ChatGPT or Claude to transcribe a YouTube video?

Not directly. Neither ChatGPT nor Claude can ingest video URLs and produce a transcript — they don't have audio access to YouTube. What works:

Extract the transcript first (Tactiq, YouTubeToTranscript, or the Apify actor).
Paste the text into ChatGPT/Claude with a prompt like "Convert this transcript into a 1,500-word blog post with H2 headings."

This two-step is the actual workflow behind 90% of "YouTube to blog post" SaaS tools — they're wrappers over a transcript fetcher + an LLM. Building it yourself with a free transcript tool + Claude costs about $0.02 per video versus $19–49/month for products like Castmagic or Opus Clip.

What format should YouTube transcripts be in?

Depends on what you're doing with them:

Use case	Best format
Feed to ChatGPT/Claude for summarization	TXT — no timestamps confuse the model
Subtitle a video editor (Premiere, DaVinci)	SRT — universal subtitle standard
Web video player captions	VTT — required by HTML5 `<track>` element
Programmatic processing, search index	JSON — word-level timestamps, structured
Blog post or show notes	TXT stripped of filler words ("um", "uh")

The Apify actor outputs all four in a single run, which is the practical reason to use it over rolling your own Whisper pipeline — you don't have to write four different formatters.

How accurate is YouTube auto-transcription vs Whisper?

YouTube's auto-captions are roughly 85–92% accurate for clear English audio from professional channels. Accuracy drops sharply for:

Accented English (down to 70–75%)
Two or more speakers overlapping (60–70%)
Technical jargon, brand names, code (lots of guessed words)
Languages other than English, Spanish, Portuguese (often unusable)

Faster-Whisper large-v3 benchmarks at 95–97% accuracy on the same audio and handles 99 languages. For a marketing team pulling quotes from interviews, that 5–10 point difference is the gap between "publishable" and "needs a 30-minute human edit."

The trade-off is speed and cost. Auto-captions are free and instant. Whisper costs money and runs at roughly 0.3–0.5× real-time on a decent GPU (so a 60-min video transcribes in 20–30 mins). The Apify actor parallelizes this across machines so 60 videos don't take 60× longer.

Common workflows people build with YouTube transcripts

A few patterns we see from people using the actor:

Podcast → SEO blog post. Transcribe weekly podcast, feed to Claude with a "create a blog post" prompt, edit lightly, publish. Drives long-tail search traffic that the audio version never could.
Competitor research. Pull transcripts of every video from a competitor's channel, search for product mentions, pricing, feature names. Beats watching 50 hours of video.
Course note generation. Transcribe a YouTube playlist (Stanford ML, MIT OCW, etc.), get markdown notes with timestamps. Searchable, skimmable.
Sermon / lecture archives. Churches, conferences, and universities transcribe back catalogs to make them searchable for site visitors.
Quote sourcing for short-form video. SRT timestamps tell you exactly when a quote happened so editors can cut clips without scrubbing.

FAQ

Q: Is there a 100% free way to convert YouTube to text without limits? Yes — Tactiq.io's free tier gives unlimited basic transcripts (no AI summary) as long as the video has captions. For uncaptioned videos, running Whisper locally with yt-dlp is free but requires technical setup and a decent computer.

Q: How long does it take to transcribe a 1-hour YouTube video? Caption-based tools (Tactiq, YouTubeToTranscript) take 5–15 seconds since they just pull existing data. Whisper-based transcription takes 20–30 minutes on a local GPU, or 2–5 minutes on a hosted Apify actor that parallelizes audio chunks.

Q: Does the Apify YouTube transcriber work on private or age-restricted videos? The actor uses Faster-Whisper on downloaded audio, so it works on any video your account can access. Age-restricted and members-only videos require providing YouTube cookies in the actor input; fully private videos shared with you also need authentication.

Q: What's the difference between SRT and VTT subtitle formats? SRT is the older, simpler format used by most video editors and players (Premiere, VLC, DaVinci). VTT is the W3C standard for HTML5 video and supports styling, positioning, and metadata that SRT can't represent. If you're embedding captions on a website, use VTT; otherwise SRT.

Q: Can I transcribe a YouTube live stream as it happens? No — the Apify actor and Whisper-based tools work on completed videos, not live streams. For live transcription, you'd need a real-time ASR service like Deepgram or AssemblyAI Streaming. Once the live stream ends and YouTube saves it as a video, you can transcribe it normally.