AI transcription costs between $0.004 and $0.36 per audio minute on API-based services, with most production-grade providers landing around $0.02 to $0.06 per minute. If you'd rather not deal with usage math at all, consumer apps charge $10 to $30 per month for a few hundred minutes, and human-graded transcription still costs roughly $1 to $1.50 per minute.
The price you actually pay depends on five things: who's doing the work (a model or a person), how accurate you need it, which language, whether you want speaker labels and timestamps, and how you buy it (per-minute, subscription, or self-hosted).
What does AI transcription actually cost per minute?
Here's a snapshot of published per-minute pricing from the providers most teams compare. All figures are USD, taken from each provider's public pricing page (linked in Sources). Where a tier has both async and streaming pricing, the async/batch number is shown — it's the cheaper of the two and the one most people are buying.
| Provider | Type | Price (USD/min) | Notes |
|---|---|---|---|
| OpenAI Whisper API | Developer API — you build the app | ~$0.006 | whisper-1; no speaker labels |
| Deepgram | Developer API | ~$0.0043 | Nova-3 batch; cheapest enterprise-grade |
| AssemblyAI | Developer API | ~$0.0062 | Universal async; diarization included |
| Google Cloud Speech-to-Text | Developer API | ~$0.024 | First 60 min/month free |
| AWS Transcribe | Developer API | ~$0.024 | Drops to ~$0.0153 at scale |
| Azure AI Speech | Developer API | ~$0.0167 | Standard batch, ~$1.00/hour |
| Rev.ai | Developer API | ~$0.02 | Machine async, ~$1.20/hour |
| Rev (human) | Done-for-you service | ~$1.50 | Per finished minute |
| Otter | Ready-to-use app | ~$20/user/mo | Subscription, ~6,000 min/mo |
| VTS | Ready-to-use app | $0.08 | Paste a link or upload → transcript in minutes, no code; no subscription; +$0.02/min speaker labels; 3 free 60s clips/mo |
Read that table by Type, not just by price. Almost every cheap row is a developer API — the per-minute number buys you raw transcription and nothing else. To actually use it you have to write the code: the upload flow, speaker labels, timestamps, SRT/VTT export, retries and error handling, and an interface to paste a link into. That's days of engineering (and a maintenance burden) before anyone gets a transcript. The "$0.0043/min" isn't the cost of a transcript — it's the cost of the API call inside an app you still have to build.
The ready-to-use apps cost more per minute because the price includes the product. With VTS you paste a link or upload a file and get a clean transcript, SRT, and timestamps back in minutes — no code, no GPU, no setup. So compare APIs to APIs and apps to apps: if you're not going to build (and maintain) a transcription app, the real choice is between the finished products, and there VTS's $0.08/min with a free tier and no subscription is the number that matters.
Self-hosting a Whisper model on your own GPU lands closer to $0.001 to $0.005 per minute of audio, depending on the GPU you rent and how full you keep it. That's pure compute — and again, only the compute. You still build everything around it. You're trading dollars for engineering time.
Why do some providers charge 50× more than others?
A 60-minute interview costs around 26 cents on Deepgram's cheapest tier and around $90 if you send it to a human transcriptionist. That's a 350× spread for nominally the same deliverable. Five things explain almost all of the gap.
Accuracy. Word error rate on clean English audio is now in the 5–10% range for the top models and around 1–2% for human-graded work. For a clean podcast you won't see the difference. For a noisy deposition with cross-talk, you will. We dug into the realistic accuracy range in what to expect from transcription accuracy.
Speaker diarization. Telling who said what is a separate model on top of the transcription model. AssemblyAI bundles it in. The Whisper API doesn't ship it at all, so you'd add a tool like pyannote yourself.
Language. Almost all per-minute prices above are for English. Some providers charge the same for any supported language; others tier non-English higher.
Streaming vs. batch. Real-time pricing is typically 1.5× to 3× the async price because the provider holds a connection open. If you're transcribing recorded files, always pick batch.
How you buy it. Per-minute API billing punishes spiky workloads but is honest. Subscriptions look cheap until you don't hit your minute cap. Human transcription is priced as labor, full stop.
Is the free or open-source option actually free?
Free tiers come in two flavors. Most providers give you a small monthly allowance: Google Cloud gives 60 free minutes per month, AssemblyAI offers credits when you sign up, and Otter's free plan caps you at a few hundred minutes per month with a 30-minute-per-recording limit. Useful for trying things out. Not a workflow.
The real free option is running an open-source model yourself. OpenAI's Whisper is MIT-licensed; faster-whisper is a community reimplementation that's roughly 4× quicker on the same hardware. We compared them in Whisper vs faster-Whisper.
The catch: "free" means free of API charges. You still pay for the GPU you run it on (an A10 on the major GPU cloud providers runs around $0.60–$0.80 per hour as of early 2026), the engineer who set it up, and the time it takes to wire diarization, timestamps, and error handling around it. For most small teams, the per-minute API works out cheaper than the engineering.
Paste any public link or upload a file and get a clean transcript in minutes. First 3 clips every month are on us — no card required.
What does it cost for a typical use case?
A few honest scenarios, all assuming an off-the-shelf cloud API at roughly $0.02 per minute (the middle of the market):
- One hour-long interview for a research paper: about $1.20.
- A weekly two-hour podcast with diarization: roughly $2.40 per episode, or about $125 per year.
- A small team transcribing 100 hours per month of sales calls: around $120 per month.
- A university recording 50 hours of lectures per week: roughly $400 per month if you batch everything.
- A 90-minute deposition sent to a human transcriptionist: around $135. The same file through an API: about $1.80, but you should re-read it.
If you want to estimate without a calculator: at $0.02/min, every 50 hours of audio costs about $60. Most consumer needs cost less than a coffee.
Is human transcription still worth $1.50 a minute?
Sometimes, yes. The honest test: if a single mistranscribed word would change the outcome — a deposition transcript, a court hearing, a medical record, a journalist's on-the-record quote that's about to go to print — pay for human-graded work, or pay for AI plus a human reviewing it. A clean hybrid (AI first pass, human edit) typically runs around $0.50 to $0.75 per minute and gets you within striking distance of full-human accuracy at half the price.
For everything else — internal meetings, podcast show notes, lecture review, research interviews you'll quote sparingly — modern AI is the right answer.
How does VTS price compare?
VTS is $0.08 per audio minute ($4.80/hour), with no subscription and no monthly fee — you top up a wallet (from $5, and it never expires) and pay only for what you upload. The first 3 clips up to 60 seconds each are free every month. Speaker labels add $0.02/min ($0.10/min total).
Yes, $0.08/min is more than Whisper (~$0.006) or Deepgram (~$0.0043) — but as the table shows, those are developer APIs you'd have to build an app around. VTS is the finished app: nothing to code, no GPU to rent, no diarization or export tooling to wire up. You paste or upload, and the transcript, SRT, and timestamps come back in minutes. The cheaper number is the API call; the $0.08 is the call plus the product you'd otherwise spend days building.
Among the things you can actually use without writing code, VTS is priced to win. Against subscriptions, the difference is the model: no monthly fee. A $20/month plan is dead money any month you transcribe an hour or two; at $0.08/min those two hours cost $9.60 and you owe nothing the months you don't use it. Against human transcription ($1–$1.50/min), VTS is roughly 15× cheaper for the large majority of work where modern AI accuracy is enough — keep the human for the deposition or the on-the-record quote.
So: not the cheapest number on the page, and we don't pretend otherwise — why we built VTS without a subscription explains the trade-off, and VTS vs other transcription services is an honest feature-by-feature look.
Pick a pricing model based on usage, not the sticker
Three quick rules to settle the decision:
- Under 10 hours a month, occasional use. Pay per minute on a tool like VTS or a no-commit API. A monthly subscription is dead money.
- Steady volume above 50 hours a month. A subscription with a generous cap usually wins. Run the math on dollars per minute including any seats you'd pay for.
- Anything where errors are expensive. Pay for human review on top of AI, or use a full human service for the audio you can't afford to get wrong.
The cheapest provider on paper isn't the cheapest in practice if you spend an hour fixing the transcript. Cost per usable transcript is the only number that matters.



