The short answer: AI transcription on accented English runs 10–20 percentage points worse than on standard American English on average, and the gap widens — sometimes to more than 2× the error rate — for Black American, West African, South Asian, and Caribbean accents. The headline number from the most cited study, by a Stanford team in 2020: 19% word error rate for white speakers vs. 35% for Black speakers across the five biggest commercial ASR systems.

That's peer-reviewed data, not a vendor's marketing page. If you transcribe interviews, lectures, or recordings where most speakers aren't standard American or British English, this matters for both fidelity and fairness — and it's the thing the polished demo reels won't tell you.

So what's the actual accuracy gap?

Word error rate (WER) is the standard measure: substitutions plus deletions plus insertions, divided by the number of reference words. We have a whole post on what WER means and what's good; here we're using it to compare speakers, not systems.

The honest range across published evaluations:

A 5-point gap sounds small until you're proofreading. WER 25 is roughly one error every four words — at that point you're not editing the transcript, you're rewriting it.

What did the Stanford study actually find?

Koenecke et al. (2020), published in PNAS, tested five commercial systems on recorded interviews with Black and white American English speakers matched for content. Average WER was 35% for Black speakers vs. 19% for white speakers. Roughly 1 in 5 audio snippets from Black speakers came back with WER above 50%, vs. roughly 1 in 50 from white speakers.

That study is five years old and pre-dates Whisper. It still gets cited because nobody has shown the gap fully closed, and because the methodology was clean enough that vendors don't argue with the numbers — they just stopped publishing comparable breakdowns.

Has Whisper closed the gap?

Partly. Whisper large-v3 (OpenAI, late 2023) was trained on roughly 680,000 hours of weakly supervised multilingual audio, and that scale helped. Its WER on standard American English is competitive with the best closed-source systems, and it handles non-native English much better than the 2020-era models the Stanford team tested.

But "better" isn't "equal." On Common Voice 15 English, Whisper large-v3 reports about 9% WER overall; segmenting by accent tag shows ~20% WER on Indian English and similar for several African Englishes. The gap shrank. It didn't disappear.

Self-hosting Whisper doesn't fix this either. Whisper hallucinations happen more often on hard audio, and accents are harder audio for a model trained mostly on standard American speech.

Which accents struggle the most?

Roughly ordered worst-to-best on current public benchmarks:

The ranking holds across most major systems, even if the absolute numbers differ. Newer models compress the spread but don't reorder it.

Why does this happen?

Training data, mostly. The big ASR systems learn from what's easy to scrape: American podcasts, US news broadcasts, US-recorded YouTube, public Western audiobooks. The pronunciation patterns the model picks up most strongly are the ones it heard most.

Two specific failure modes show up:

The model isn't biased in a dramatic way. It's biased in the dull statistical way: more data on accent X means lower WER on accent X.

What actually fixes it in practice?

Three things, in order of effort and payoff:

  1. Improve the audio. A close mic on the speaker, low background noise, and lossless capture do more for accented-speech accuracy than swapping models. See our post on audio quality before transcribing.
  2. Add a vocabulary or hotword list. Most modern tools accept domain-specific terms — technical jargon, person names, place names. This is the single biggest free fix and almost nobody uses it.
  3. Pick a model that's been multilingually trained. Whisper large-v3 and the newer commercial systems all do better on non-American English than older releases. If you transcribe a recording and the result is poor, try a different system before assuming AI can't handle it.

If you're transcribing fully multilingual content rather than accented English, we have a separate post on transcribing multilingual content.

When should you skip AI and pay a human?

If the recording is any of:

…and the speaker has a strong non-standard accent — pay a human. AI WER above 20% means you're spending the savings, and then some, on cleanup time. Below 15%, AI plus a careful read is usually faster and cheaper.

The pragmatic middle ground for most users: start with a good AI transcript, do a single pass against the audio, and use timestamps to jump to the parts that read weird. That workflow handles 90% of accented-English cases without a human transcriptionist's price tag.

Try it now — it's free
Transcribe your video with VTS

Paste any public link or upload a file and get a clean transcript in minutes. First 3 clips every month are on us — no card required.

Start transcribing No subscription · 8¢/min after free clips

Sources