Open the transcript, scan for the moment that matters, and there it is — [INAUDIBLE]. Then another one. Three paragraphs later, a sentence that's half brackets. The interview you can't replay. The deposition you can't redo.
Nine times out of ten, this isn't the AI failing. It's the audio that reached the AI. Once you can recognize the situation that produced each [INAUDIBLE], you can fix most of them — and prevent the rest in the next recording.
Here are the seven situations we see most often, ordered by how often they show up in real recordings, with the fix for each.
- [INAUDIBLE] is the model's "I tried and I'm not confident" signal, not silence.
- Background noise and far-field mics cause the majority of cases — both are preventable.
- Cross-talk and clipping cannot be fixed in software after the fact.
- Jargon and accents are fixable with custom vocabularies and the right service.
- If more than 15–20% of segments come back as [INAUDIBLE], re-record or re-export from the master.
What [INAUDIBLE] actually is
ASR models output a confidence score for every chunk of audio they decode. When that confidence drops below a threshold the provider has set, the text gets replaced with a marker — [INAUDIBLE], [inaudible], [unintelligible], or sometimes just .... It's not the model saying "this is silence." It's the model saying "I tried to decode this, and I'm not confident enough in any answer to commit one to paper."
Whisper, run raw without that guardrail, will sometimes hallucinate confidently in the same spot instead — which is its own problem, covered in why Whisper hallucinates and how to fix it. Most paid services flag rather than guess.
1. Background noise drowning the speech
By far the most common cause. Air conditioning hum, refrigerator compressor, traffic through a cracked window, a laptop fan spinning under the recorder. The model handles a quiet room with a clear voice. It can't reliably separate speech from noise that's within ~10 dB of the speaker.
The fix: get the microphone close to the speaker — within 12 inches, ideally six. If you're stuck with an already-noisy recording, light denoising before transcription helps; aggressive denoising introduces artifacts the model handles worse than the original noise. The preventive checklist is in our piece on audio quality before transcribing.
2. Microphone too far from the speaker
A phone on the conference table six feet from the person talking is recording the room, not the person. The reverb tail of the room blurs every consonant. The model loses words at the start and end of phrases — exactly the words that change a sentence's meaning.
Fix: a lapel mic on each speaker, or a USB mic placed within arm's reach of the loudest voice. For interview situations specifically, the trade-offs are in lavalier vs handheld for interview transcripts.
3. Overlapping speakers (cross-talk)
When two people talk at once, the waveform isn't two signals layered — it's a single fused signal that no model can cleanly separate without channel information. Whisper picks one voice, the other, or commits to neither. The last option produces [INAUDIBLE].
Fix on the recording side: set a one-at-a-time rule for interviews; for podcasts and remote meetings, record each participant on their own channel so the model sees clean audio per voice. Fix on the cleanup side: there isn't one. You'll be hand-transcribing the overlapped chunks.
4. Accented or non-native speech
Models trained mostly on US and UK English struggle on strong regional accents and code-switching between languages. Confidence drops, [INAUDIBLE] climbs. This isn't a bug — it's a training-data distribution problem. Mozilla's Common Voice corpus, which feeds many open models, is still heavily weighted toward a handful of accent groups.
Fix: pick a service that's been benchmarked on your speakers' accent profile, and run a five-minute test before you commit to transcribing the whole file. We dig into where AI transcription actually breaks on accents in how accurate AI transcription is for accented English.
5. Industry jargon the model has never seen
Medical, legal, scientific, niche product names — if the model has low prior probability for the correct word, it would rather output [INAUDIBLE] than commit to a guess. This is especially true of careful, paid professional services. Whisper, by contrast, often hallucinates a phonetically similar everyday word in the same spot.
Fix: feed a custom vocabulary list. Most paid services accept one, and you can copy ours — a free custom vocabulary template that works across AssemblyAI, AWS Transcribe, and Deepgram.
6. Compressed or low-bitrate audio
WhatsApp voice notes, Discord voice channels recorded at 32 kbps, Zoom recordings on the lowest quality setting — compression strips frequency content the model uses to distinguish similar phonemes. The result is words the model can't decode and replaces with [INAUDIBLE]. You can often hear the audio "as a human" because your brain is filling in the missing detail. The model isn't.
Fix: use the highest-quality export available on the source platform. Re-recording or re-exporting from a higher-fidelity master always beats trying to upscale low-bitrate audio.
7. Clipping (audio recorded too hot)
When the input level was too loud at record time, the waveform peaks flat at the top — the speech has turned into something close to a square wave. The model sees distortion, not speech, and refuses to commit.
Clipping cannot be reversed in software. Whatever gain staging happened at record time is baked into the file. If you're being handed clipped audio, the [INAUDIBLE] markers on those sections are permanent.
If you're recording yourself, set the input level so peaks land around -12 to -6 dBFS, never above. If the clipping is happening on a single speaker — common in interviews where one voice is louder — move that speaker's mic farther away or drop their channel's gain.
A workflow for salvaging a transcript
When a finished transcript comes back with [INAUDIBLE] sprinkled through it, work in this order:
Open three of the affected sections in your audio editor and listen. You're diagnosing which of the seven causes applies to each — they have different fixes.
If a higher-quality master exists upstream (the original Zoom recording before it was re-exported, a multi-channel podcast stem, the original interview recorder file), re-export and re-transcribe from the master.
Apply a custom vocabulary list if jargon was the issue, then re-transcribe the affected sections.
For overlap, transcribe the per-speaker channels separately if you have them; otherwise hand-transcribe the overlapped chunks.
For everything left, manual cleanup — and if you're paying per minute, do this only on the minutes you actually need, not the whole transcript.
A better second pass often lives one tool change away. Try transcribing a clean copy on the master file before committing to manual rescue.
When to stop fixing and re-record
If more than 15–20% of segments come back as [INAUDIBLE], the cost of salvaging the transcript is usually higher than the cost of re-recording — and the salvaged version will still be unreliable for court, for a published research paper, or for a customer interview synthesis you'll quote in a deck.
Re-recording isn't always possible. A deposition is one-shot. So is an oral history. For those, accept the limit, mark the [INAUDIBLE] sections in the final document so the reader knows they exist, and don't paraphrase what you couldn't actually hear. That's the difference between an honest transcript and a fabricated one.
Paste any public link or upload a file and get a clean transcript in minutes. First 3 clips every month are on us — no card required.
Sources
- OpenAI Whisper paper: Robust Speech Recognition via Large-Scale Weak Supervision
- OpenAI Whisper README and model card: github.com/openai/whisper
- AWS Transcribe custom vocabulary documentation: docs.aws.amazon.com
- Mozilla Common Voice dataset overview: commonvoice.mozilla.org/en/datasets



