Open a long Whisper transcript and search for "thanks for watching." If you find one floating in the middle of a podcast where nobody actually said it, you've already met the failure mode voice activity detection exists to fix.

VAD is the step in a transcription pipeline that decides which slices of an audio file contain speech and which contain silence, breathing, music, or background noise. The model only sees the speech parts. Skip VAD on a 90-minute recording and you'll spend more, wait longer, and get a transcript with sentences hallucinated into the quiet stretches.

What does voice activity detection actually do?

VAD takes an audio file and returns a list of (start, end) timestamps where it heard speech. Everything outside those windows is dropped. Most modern systems run a small neural classifier over short audio frames, 10 to 30 milliseconds each, labeling every frame "speech" or "not speech," then merging adjacent speech frames into segments.

The output is the only thing your transcription model ever sees. The silence is thrown out, never billed, never decoded.

Why does long-form audio need VAD?

Three concrete reasons.

Hallucination. Whisper is trained on speech, not silence, and it has a learned habit of filling long quiet stretches with text it remembers from training: YouTube outros, "subscribe and like," random Wikipedia-flavored sentences. The full pattern, with examples, is in our writeup on why Whisper hallucinations happen and how to fix them. VAD is the single most effective mitigation.

Cost. Most commercial APIs bill on the duration of audio you submit. A 60-minute interview with five-minute pauses between questions might be 45 minutes of actual speech. That's a 25% bill cut for free, before you optimize anything else.

Downstream quality. Diarization and word-level alignment both behave better when they're not trying to model who's "speaking" during a dishwasher running in the next room. Speaker labels get cleaner. Word timestamps land tighter, which matters if you're publishing captions or using forced alignment to pin words to timecodes.

How does VAD decide what's speech?

Two families.

The older approach is signal-processing: measure short-term energy, zero-crossing rate, and spectral flatness in each frame and apply thresholds. WebRTC's VAD is the canonical example. It's tiny, fast, and runs in real-time on a Raspberry Pi. It also gets confused by music, fan noise, and quiet talkers.

The current default is neural: a small CNN or RNN trained on labeled speech-versus-non-speech data, scoring each frame. Silero VAD ships as a 1.8 MB ONNX model, runs at hundreds of times real-time on a CPU, and handles most of the cases that trip up signal-based detectors. Pyannote ships a slightly larger PyTorch model with tighter integration into its diarization pipeline.

Which VAD library should you use?

VAD Type Best for Trade-off
Silero VAD Neural ONNX Most pipelines. Fast, accurate, MIT-licensed Defaults can clip short utterances
Pyannote VAD Neural PyTorch When you're already running Pyannote diarization Heavier dependency stack
WebRTC VAD Signal-based Real-time on constrained hardware Less accurate on noisy or music-heavy audio
Whisper's no_speech_threshold Per-chunk gate Not a VAD. Don't rely on it Runs after inference, after you've paid the compute

Whisper's no_speech_threshold confuses a lot of new pipelines. It is not voice activity detection. It's a probability check the model runs after it has already attempted to decode a 30-second chunk; if the "no speech" probability is high enough, the chunk gets discarded. You still paid the inference cost. Real VAD runs before the transcription model and trims the input.

If you're starting from scratch, use Silero. If you're already pulling in Pyannote for diarization, use Pyannote's VAD and skip the duplication.

What are the common VAD mistakes?

Patterns we see again and again in pipeline reviews:

Speech padding set too tight. VAD outputs raw speech boundaries. Feed those directly to Whisper and you'll often clip the first 50 to 100 milliseconds of each utterance, chopping off plosive consonants like /p/, /t/, /k/. Pad each segment by 100 to 200 ms on each side before passing it to the model.

Min-silence too short. If you split aggressively at every 250 ms gap, you fragment continuous sentences across multiple Whisper calls. The model loses context and starts repeating itself at the boundaries. Use a min-silence of 500 ms or more for conversational speech.

Aggressive mode on noisy audio. WebRTC's "aggressive" mode (level 3) is tuned to reject more. Useful in low-noise calls, terrible in a café recording where it'll drop quiet speakers entirely. Test on a sample of your real audio before locking in a global setting.

Re-chunking after VAD. If you segment with VAD first and then re-chunk to 30-second windows for Whisper, you can split a single speech segment across a window boundary mid-word. Either chunk first then VAD inside each chunk, or pad-and-merge VAD segments up to 30 seconds before passing them to the model.

When can you skip VAD entirely?

Short files under 30 seconds of continuous speech don't really need it. Whisper will handle them fine. Pre-trimmed clips out of a video editor that already cut the silence are fine. Voicemails are fine.

For anything else, podcasts, interviews, depositions, meetings, lectures, long-form recordings, VAD pays for itself in cost and accuracy in the first minute of audio.

This is the kind of plumbing most people building a transcription tool reinvent at least twice. If you'd rather skip that and just transcribe a recording with the silence handled for you, that's the version we ship.

Try it now — it's free
Transcribe your video with VTS

Paste any public link or upload a file and get a clean transcript in minutes. First 3 clips every month are on us — no card required.

Start transcribing No subscription · 8¢/min after free clips

The bottom line

VAD is not glamorous infrastructure. Nobody ever picked a transcription tool because of its silence detection. But the gap between a pipeline that does a proper VAD pass and one that doesn't is the gap between a clean 100-page deposition transcript and one with three phantom "thanks for watching" lines that the lawyer has to explain to the judge.

Pick a VAD. Pad your segments. Test on your real audio. Then transcribe.

Sources