Diarization error rate (DER) is the percentage of audio in a recording where your speaker labels are wrong. It is to speaker diarization what word error rate is to transcription: the single number a system gets judged on.

It's a different metric from WER, and the two often move in opposite directions. A transcript can be word-perfect and still have a 30% DER, which means a third of it is attributed to the wrong person. If you've ever read a Zoom transcript where Speaker 1 keeps "saying" things Speaker 2 actually said, you've lived inside high DER.

How DER is actually calculated

The score has three buckets, each measured against a ground-truth annotation of who spoke when:

DER = (missed + false alarm + speaker error) / total reference speech time.

The NIST Rich Transcription evaluations, where this metric became standard, also apply a small forgiveness collar around segment boundaries (usually 250 ms) so the score isn't dominated by millisecond-level boundary disagreements.

Two consequences of how it's built. First, overlap matters: if two people talk at once, the reference says both spoke, and a system that picks only one is penalized. Second, the labels themselves don't need to match. DER finds the best mapping between your "Speaker 1" and the reference "Alice". What counts is whether the audio assigned to the same person stays consistent.

How is DER different from WER?

WER is about what was said. DER is about who said it. You can be perfect on one and terrible on the other.

Metric Measures Hurt by
WER Word recognition Noise, accents, jargon, model limits
DER Speaker attribution Overlap, similar voices, short turns, far microphones

A clean dictation from one person has near-zero DER by default — one speaker, no confusion possible. A four-way Zoom with people interrupting can have a low WER and a brutal DER: every word transcribed, half of them assigned to the wrong head.

For the underlying mechanic, the plain-English explainer on speaker diarization covers how a system tries to separate voices in the first place. DER is just how we score that separation.

What counts as a good DER score?

There's no universal threshold, only the one for the kind of audio you have. Some honest reference points from published benchmarks:

If a vendor quotes "95% diarization accuracy," ask what corpus and what collar. Those numbers are not directly comparable to published DER unless they say so. For the broader picture of how transcription quality is reported, see what to expect from accuracy.

What drives DER up

In practice, four things explain most of the score:

How do you lower DER on your own recordings?

You won't change the model, but you can change the input. The cheapest wins:

When you upload audio to transcribe a meeting, the diarization quality you get out is bounded by the audio quality you put in.

Should you compute DER on your own files?

Usually no. DER requires a hand-labeled reference, which is more work than just fixing the speaker labels in your transcript by hand. Compute it if you're evaluating a vendor or training a model. Otherwise, the practical equivalent is a quick scan: pick a few minutes, check whether the labels match reality, fix the obvious mis-assignments, and move on.

The metric is most useful as a vocabulary. Once you know that "who said what" has its own number, you stop blaming the transcriber for a problem that lives upstream in your microphone setup.

Try it now — it's free
Transcribe your video with VTS

Paste any public link or upload a file and get a clean transcript in minutes. First 3 clips every month are on us — no card required.

Start transcribing No subscription · 8¢/min after free clips

Sources