Usually, tools like ChatGPT and Claude don't transcribe YouTube or Facebook videos, not because they can't, but because transcription isn't their core business. Another challenge is processing. Platforms like YouTube and Facebook often require transcription pipelines, proxy IP handling, and additional infrastructure just to access and process media reliably.

What that means in practice: there's no single Transcribe button inside ChatGPT, drag-and-drop audio files don't reliably get transcribed, and the obvious assumption — paste a YouTube link, get a transcript back — doesn't work out of the box. ChatGPT can't fetch the audio off a video URL on its own.

What does work is two paths. Voice mode transcribes what you say into the mic in real time, inside the ChatGPT mobile or Desktop app. And a custom connector — like the VTS one we maintain — adds the missing infrastructure (the pipeline, the proxy handling, the engine) and exposes a transcription tool that ChatGPT can call when you paste a YouTube or Facebook link. Those two cover most real use cases. Everything else either lives on the OpenAI API side (Whisper) or means using a dedicated transcription app and bringing the text back into ChatGPT afterward.

Key takeaways
  • ChatGPT does not natively transcribe audio or video files you drag into the chat. Uploads might preview a waveform, but you won't get a reliable transcript out.
  • Voice mode is the one native path — it transcribes you speaking live, on the ChatGPT mobile and Desktop apps. The full transcript saves to the chat history.
  • For pre-recorded URLs, add a custom connector. The VTS connector supports YouTube and Facebook links (videos and reels) — paste the URL in chat, ChatGPT calls the tool, transcript comes back in the same conversation.
  • Other sources (raw .mp3 files, Zoom recordings, podcasts you have locally) — transcribe at vts.askgiya.com first, then paste the text into ChatGPT for the summary/quotes/analysis step.
  • "Is it free?" Voice mode has a free daily cap. VTS gives you 300 minutes/month free; past that you top up your wallet or move to the Pro plan.

The end-to-end demo:

How can I transcribe my audio in ChatGPT?

The honest answer depends on what "my audio" is:

That third bucket is where most people get stuck. They assume ChatGPT does what voice mode does for any audio file — it doesn't. Live mic and pre-recorded file are two completely different code paths inside the app.

Is it free to use ChatGPT for audio transcription?

Voice mode is free on chatgpt.com accounts with a daily cap on minutes. Sustained use hits the cap and bounces you back to text-only for the rest of the day. Plus ($20/month) raises the cap substantially; Team and Enterprise raise it further.

The custom-connector path is free to set up — connectors themselves don't cost anything. You pay only for the transcription work the connector does. With VTS that's 300 minutes per month on the free tier, then 6¢/min from your wallet, or a flat 25 hours/month on Pro. No markup for going through ChatGPT versus going to VTS directly.

The Whisper API (for developers writing scripts) is pay-per-use — currently around half a cent per minute of audio. Cheap, but you have to write the integration yourself.

How to make a transcribe — the three working flows

"Make a transcribe" usually means: I have something I want turned into text — what do I actually click? Here's the shortest path for each input shape.

If it's a live conversation (you talking)

1
Open the ChatGPT mobile app or Desktop app.

Voice mode isn't on the web yet.

2
Tap the headphones icon

in the bottom-right of any chat (mobile) or the mic icon in the composer (Desktop).

3
Talk.

ChatGPT transcribes you in real time. Tap end-call when you're done; the full text conversation stays in your chat history.

If it's a YouTube or Facebook link

1
Add the VTS connector once.

Go to chatgpt.com → your avatar → SettingsConnectorsAdd custom connector. Paste the server URL https://vts.askgiya.com/mcp and name it "VTS Transcription." Click Sign in and approve the OAuth prompt. Takes about 60 seconds.

2
Paste the link in any chat with a clear ask.

For example: "Transcribe this YouTube video and pull the three best quotes with timestamps: https://www.youtube.com/watch?v=..."

3
Wait 30 seconds to a few minutes

depending on length. ChatGPT shows "Using VTS Transcription…" while it works, then drops the transcript card into the chat and answers your follow-up question in the same turn.

If it's a local audio file

Don't try to upload the file to ChatGPT and ask it to transcribe. It's not a supported flow — you'll get either a polite refusal or a partial result that's wrong in ways you can't see. Do this instead:

1
Go to vts.askgiya.com

. Drag the file into the uploader, or paste a public URL if you have one.

2
Wait for the transcript

(usually under three minutes for a 30-minute file). Pick the format you want — plain text, timestamps, or SRT.

3
Copy the transcript and paste it into ChatGPT

with whatever question you want answered. "Here's the transcript of yesterday's interview. Pull the five sharpest quotes about pricing and group them by topic."

Two tools, but the round trip is faster than fighting ChatGPT's file uploader.

Try it now — it's free
Transcribe your video with VTS

Paste any public link or upload a file and get a clean transcript in minutes. First 3 clips every month are on us — no card required.

Start transcribing No subscription · 8¢/min after free clips

How to automatically transcribe in ChatGPT

The phrase "automatically" usually means one of three things, and the right answer is different for each.

The same connector pattern works on Claude — we walked through the Claude version separately. MCP is the open standard underneath; the OAuth flow is identical on both sides.

How do I start doing transcription? Your first five minutes

If you've never transcribed anything before, pick the path based on what you actually have:

Don't try to fight ChatGPT's chat composer into doing file transcription it's not designed for. The split between live voice (native) and pre-recorded media (connector or external tool) is real, and working with that grain is much faster than against it.

How to activate Live Transcribe (voice mode) in ChatGPT

"Live Transcribe" in the ChatGPT context is voice mode — you talk, ChatGPT writes down what you said and replies. It's on the iOS, Android, and Desktop apps; not the web client.

1
Update the ChatGPT app

to the current version. Voice mode rolled out in waves; older builds won't have it.

2
Open the app and tap Settings

(avatar → Settings).

3
Go to Voice

(sometimes labelled Speech). Pick a voice (Juniper, Sky, Cove, Breeze, Ember etc.) and set your Main Language to "Auto-detect" if you bounce between languages.

4
Open any chat and tap the headphones icon

(bottom-right on mobile, in the composer on Desktop). The screen switches to the voice-mode view with a pulsing orb.

5
Speak.

ChatGPT transcribes in real time, sends what you said as the next message, and replies aloud. The full transcribed conversation is saved as text in the chat history — scroll back any time.

Common failure → fix:

About the recordings. Voice mode keeps the text transcript in your chat history. OpenAI states the audio itself isn't retained after the session by default, but check Settings → Data Controls → "Improve the model for everyone" if you don't want it used for training.

When ChatGPT isn't enough

The honest scope of "transcription in ChatGPT" is narrower than people expect:

Where you'll feel the limits most: long-form journalism interviews, Zoom calls with three or more speakers, and any content where word-perfect accuracy matters. ChatGPT is brilliant at the next step — summarising the transcript, pulling quotes, drafting a piece around it. The transcription itself is better outsourced.

FAQ

Can I just drag an mp3 into ChatGPT and ask it to transcribe?

In practice, no. ChatGPT may accept the upload and acknowledge it, but you won't reliably get a full, accurate transcript back. The supported native path for transcription is voice mode (live mic only). For files, transcribe at vts.askgiya.com and paste the result into ChatGPT.

What sources does the VTS connector for ChatGPT support?

Public links from YouTube (youtube.com, youtu.be, m.youtube.com, music.youtube.com) and Facebook (facebook.com, m.facebook.com, web.facebook.com, fb.watch). Direct file uploads, Vimeo, TikTok, X, and arbitrary audio URLs aren't part of the connector right now — for those, use the VTS web app directly.

What's the difference between voice mode and the Whisper API?

Voice mode is the live mic feature inside the ChatGPT app, optimised for conversation. The Whisper API (and gpt-4o-transcribe) is a developer endpoint you call from code — better for batch and unattended pipelines, but you have to write the integration yourself.

How accurate is the connector-based transcription?

For clean YouTube and Facebook content in major languages, word-error rate is typically 3–8%. Songs, heavy accents, and noisy backgrounds push that higher — see our accented-English accuracy bands and the /sla page for the per-audio-type bands plus the auto-escalation ladder VTS runs to keep noisy audio inside the band.

Can ChatGPT label speakers in a multi-person conversation?

Voice mode treats everything as one speaker. The connector can pass diarize=True to VTS for YouTube/Facebook URLs with multiple voices — say "transcribe with speaker labels" in your prompt and ChatGPT will call the tool with that option set.

Is the connector the same on Claude and ChatGPT?

Yes. The same MCP server (https://vts.askgiya.com/mcp) works as a custom connector in both. We walked through the Claude setup separately; the OAuth flow is identical on ChatGPT.

Once you've got voice mode on and the connector wired in for YouTube and Facebook, ChatGPT covers two of the three real transcription needs. The third — local files, longer-form media, multilingual content — is what dedicated tools like VTS exist for. Use the right one for the right input, and the round trip is fast.

Sources