Usually, tools like ChatGPT and Claude don't transcribe YouTube or Facebook videos, not because they can't, but because transcription isn't their core business. Another challenge is processing. Platforms like YouTube and Facebook often require transcription pipelines, proxy IP handling, and additional infrastructure just to access and process media reliably.
What that means in practice: there's no single Transcribe button inside ChatGPT, drag-and-drop audio files don't reliably get transcribed, and the obvious assumption — paste a YouTube link, get a transcript back — doesn't work out of the box. ChatGPT can't fetch the audio off a video URL on its own.
What does work is two paths. Voice mode transcribes what you say into the mic in real time, inside the ChatGPT mobile or Desktop app. And a custom connector — like the VTS one we maintain — adds the missing infrastructure (the pipeline, the proxy handling, the engine) and exposes a transcription tool that ChatGPT can call when you paste a YouTube or Facebook link. Those two cover most real use cases. Everything else either lives on the OpenAI API side (Whisper) or means using a dedicated transcription app and bringing the text back into ChatGPT afterward.
- ChatGPT does not natively transcribe audio or video files you drag into the chat. Uploads might preview a waveform, but you won't get a reliable transcript out.
- Voice mode is the one native path — it transcribes you speaking live, on the ChatGPT mobile and Desktop apps. The full transcript saves to the chat history.
- For pre-recorded URLs, add a custom connector. The VTS connector supports YouTube and Facebook links (videos and reels) — paste the URL in chat, ChatGPT calls the tool, transcript comes back in the same conversation.
- Other sources (raw
.mp3files, Zoom recordings, podcasts you have locally) — transcribe at vts.askgiya.com first, then paste the text into ChatGPT for the summary/quotes/analysis step. - "Is it free?" Voice mode has a free daily cap. VTS gives you 300 minutes/month free; past that you top up your wallet or move to the Pro plan.
The end-to-end demo:
How can I transcribe my audio in ChatGPT?
The honest answer depends on what "my audio" is:
- You speaking, live, into a phone or laptop mic — turn on voice mode in the ChatGPT app. It transcribes everything you say into the chat and replies aloud. This is the only native, reliable transcription path.
- A YouTube or Facebook video — wire up a custom connector once, then paste the link in any chat. ChatGPT calls the connector's transcription tool and the transcript appears alongside the answer to your question.
- A pre-recorded file you have locally (
.mp3,.m4a, Zoom.mp4, voice memos) — ChatGPT isn't the right place to start. Transcribe at vts.askgiya.com (drag-and-drop, upload, or paste a public URL), then paste the resulting transcript back into ChatGPT for whatever you want to do with it — summarising, quote extraction, rewriting, translating.
That third bucket is where most people get stuck. They assume ChatGPT does what voice mode does for any audio file — it doesn't. Live mic and pre-recorded file are two completely different code paths inside the app.
Is it free to use ChatGPT for audio transcription?
Voice mode is free on chatgpt.com accounts with a daily cap on minutes. Sustained use hits the cap and bounces you back to text-only for the rest of the day. Plus ($20/month) raises the cap substantially; Team and Enterprise raise it further.
The custom-connector path is free to set up — connectors themselves don't cost anything. You pay only for the transcription work the connector does. With VTS that's 300 minutes per month on the free tier, then 6¢/min from your wallet, or a flat 25 hours/month on Pro. No markup for going through ChatGPT versus going to VTS directly.
The Whisper API (for developers writing scripts) is pay-per-use — currently around half a cent per minute of audio. Cheap, but you have to write the integration yourself.
How to make a transcribe — the three working flows
"Make a transcribe" usually means: I have something I want turned into text — what do I actually click? Here's the shortest path for each input shape.
If it's a live conversation (you talking)
Voice mode isn't on the web yet.
in the bottom-right of any chat (mobile) or the mic icon in the composer (Desktop).
ChatGPT transcribes you in real time. Tap end-call when you're done; the full text conversation stays in your chat history.
If it's a YouTube or Facebook link
Go to chatgpt.com → your avatar → Settings → Connectors → Add custom connector. Paste the server URL https://vts.askgiya.com/mcp and name it "VTS Transcription." Click Sign in and approve the OAuth prompt. Takes about 60 seconds.
For example: "Transcribe this YouTube video and pull the three best quotes with timestamps: https://www.youtube.com/watch?v=..."
depending on length. ChatGPT shows "Using VTS Transcription…" while it works, then drops the transcript card into the chat and answers your follow-up question in the same turn.
If it's a local audio file
Don't try to upload the file to ChatGPT and ask it to transcribe. It's not a supported flow — you'll get either a polite refusal or a partial result that's wrong in ways you can't see. Do this instead:
. Drag the file into the uploader, or paste a public URL if you have one.
(usually under three minutes for a 30-minute file). Pick the format you want — plain text, timestamps, or SRT.
with whatever question you want answered. "Here's the transcript of yesterday's interview. Pull the five sharpest quotes about pricing and group them by topic."
Two tools, but the round trip is faster than fighting ChatGPT's file uploader.
Paste any public link or upload a file and get a clean transcript in minutes. First 3 clips every month are on us — no card required.
How to automatically transcribe in ChatGPT
The phrase "automatically" usually means one of three things, and the right answer is different for each.
-
"I want ChatGPT to fetch the audio itself instead of me uploading it." That's the connector. The VTS connector lets ChatGPT call
transcribe_urlagainst a YouTube or Facebook link you paste. You don't have to download anything. Note the scope: VTS handlesyoutube.com,youtu.be,m.youtube.com,music.youtube.com,facebook.com,m.facebook.com,web.facebook.com, andfb.watch. Other sources (Vimeo, TikTok, X, raw audio URLs) aren't in the connector's supported-host list right now — you'd transcribe those at vts.askgiya.com directly. -
"I want a hands-off pipeline — new files in, transcripts out, no chat involvement." That's the OpenAI Whisper API or the newer
gpt-4o-transcribeendpoint, called from a script you write. Not interactive — you trade the chat UI for full automation. -
"I want ChatGPT to always apply the same prompt when I bring in a transcript." Use a Project (Plus and above). Pin a custom instruction like "For every transcript I share in this project, output a 3-bullet TLDR, the five sharpest quotes with timecodes, and a 200-word executive summary." New transcripts you paste in run that prompt automatically.
The same connector pattern works on Claude — we walked through the Claude version separately. MCP is the open standard underneath; the OAuth flow is identical on both sides.
How do I start doing transcription? Your first five minutes
If you've never transcribed anything before, pick the path based on what you actually have:
-
Got a phone with voice memos? Go to vts.askgiya.com, drag the
.m4ain, wait two minutes. Done. That gives you a transcript you can paste into ChatGPT for the thinking part — summarising, pulling quotes, rewriting. -
Got a YouTube link? Set up the VTS connector in ChatGPT (60 seconds, one-time) and from then on every YouTube link you paste with a "transcribe this" prompt gets handled inline.
-
Want to dictate something now? Open the ChatGPT mobile or Desktop app, tap the headphones, and start talking. Live transcription, no setup.
Don't try to fight ChatGPT's chat composer into doing file transcription it's not designed for. The split between live voice (native) and pre-recorded media (connector or external tool) is real, and working with that grain is much faster than against it.
How to activate Live Transcribe (voice mode) in ChatGPT
"Live Transcribe" in the ChatGPT context is voice mode — you talk, ChatGPT writes down what you said and replies. It's on the iOS, Android, and Desktop apps; not the web client.
to the current version. Voice mode rolled out in waves; older builds won't have it.
(avatar → Settings).
(sometimes labelled Speech). Pick a voice (Juniper, Sky, Cove, Breeze, Ember etc.) and set your Main Language to "Auto-detect" if you bounce between languages.
(bottom-right on mobile, in the composer on Desktop). The screen switches to the voice-mode view with a pulsing orb.
ChatGPT transcribes in real time, sends what you said as the next message, and replies aloud. The full transcribed conversation is saved as text in the chat history — scroll back any time.
Common failure → fix:
- No headphones icon. You're on the web client or an older build. Update the app, or switch to mobile/Desktop.
- Voice mode says "not available in your region." OpenAI rolled voice mode out per-region. Check the OpenAI status page or wait for the rollout.
- Voice mode cuts off mid-conversation. You hit the daily voice cap. Plus raises the cap substantially.
- Background noise breaks recognition. Voice mode does best on a clean mic. A USB lavalier or a phone headset is a meaningful upgrade — same principle as in our lavalier vs handheld breakdown for interviews.
About the recordings. Voice mode keeps the text transcript in your chat history. OpenAI states the audio itself isn't retained after the session by default, but check Settings → Data Controls → "Improve the model for everyone" if you don't want it used for training.
When ChatGPT isn't enough
The honest scope of "transcription in ChatGPT" is narrower than people expect:
- Live mic only for native ChatGPT — anything pre-recorded needs an external tool or a connector.
- Connector scope is URL-based, not file-based — the VTS connector handles YouTube and Facebook links. For everything else (local files, songs, multilingual podcasts, Zoom recordings with multiple speakers, depositions, focus groups), use VTS directly and paste the transcript into ChatGPT afterward.
- No real diarisation in voice mode or basic file uploads — for accurate speaker labels, use a dedicated tool like VTS with diarisation enabled.
Where you'll feel the limits most: long-form journalism interviews, Zoom calls with three or more speakers, and any content where word-perfect accuracy matters. ChatGPT is brilliant at the next step — summarising the transcript, pulling quotes, drafting a piece around it. The transcription itself is better outsourced.
FAQ
Can I just drag an mp3 into ChatGPT and ask it to transcribe?
In practice, no. ChatGPT may accept the upload and acknowledge it, but you won't reliably get a full, accurate transcript back. The supported native path for transcription is voice mode (live mic only). For files, transcribe at vts.askgiya.com and paste the result into ChatGPT.
What sources does the VTS connector for ChatGPT support?
Public links from YouTube (youtube.com, youtu.be, m.youtube.com, music.youtube.com) and Facebook (facebook.com, m.facebook.com, web.facebook.com, fb.watch). Direct file uploads, Vimeo, TikTok, X, and arbitrary audio URLs aren't part of the connector right now — for those, use the VTS web app directly.
What's the difference between voice mode and the Whisper API?
Voice mode is the live mic feature inside the ChatGPT app, optimised for conversation. The Whisper API (and gpt-4o-transcribe) is a developer endpoint you call from code — better for batch and unattended pipelines, but you have to write the integration yourself.
How accurate is the connector-based transcription?
For clean YouTube and Facebook content in major languages, word-error rate is typically 3–8%. Songs, heavy accents, and noisy backgrounds push that higher — see our accented-English accuracy bands and the /sla page for the per-audio-type bands plus the auto-escalation ladder VTS runs to keep noisy audio inside the band.
Can ChatGPT label speakers in a multi-person conversation?
Voice mode treats everything as one speaker. The connector can pass diarize=True to VTS for YouTube/Facebook URLs with multiple voices — say "transcribe with speaker labels" in your prompt and ChatGPT will call the tool with that option set.
Is the connector the same on Claude and ChatGPT?
Yes. The same MCP server (https://vts.askgiya.com/mcp) works as a custom connector in both. We walked through the Claude setup separately; the OAuth flow is identical on ChatGPT.
Once you've got voice mode on and the connector wired in for YouTube and Facebook, ChatGPT covers two of the three real transcription needs. The third — local files, longer-form media, multilingual content — is what dedicated tools like VTS exist for. Use the right one for the right input, and the round trip is fast.
Sources
- OpenAI — ChatGPT — official web client; voice mode lives in the mobile and Desktop apps.
- OpenAI — voice mode FAQ — availability, languages, and quotas.
- OpenAI status — check here when voice mode isn't working.
- Model Context Protocol — official spec — the open standard custom connectors implement.
- VTS accuracy expectations — published bands by audio type plus the auto-escalation ladder.



