Accuracy expectations & service standards
What we promise on transcript quality, how we measure
it, what we do automatically when a result comes back below its
expected band, and what we'll do if something genuinely goes wrong on
our end.
Last updated 2026-05-27 · Applies to all transcriptions
processed by VTS
1. Accuracy bands
These are ranges we observe across real audio — not contractual
guarantees. Audio quality, accents, overlapping speakers, and
music all bound the ceiling. We auto-retry with stronger settings
to push every transcript into its expected band, but no model
produces a reliable transcript on heavy music or fully-degraded
source audio.
Speech audio
| Audio type | Expected accuracy |
| Clear audio, one speaker, minimal noise | 90% to 98% |
| Clear meeting audio, multiple speakers | 80% to 95% |
| Background noise, accents, overlapping speakers | 60% to 85% |
| Poor audio, music, low volume, heavy noise | No guaranteed accuracy |
Songs & music
| Song type | Expected accuracy |
| Clear vocals, slow song, minimal effects | 70% to 92% |
| Pop song with music and effects | 50% to 80% |
| Rap, fast vocals, ad-libs, overlapping voices | 40% to 75% |
| Heavy music, distorted vocals, live concert audio | No guaranteed accuracy |
No guaranteed accuracy means we'll
still transcribe — and run the full escalation ladder below — but the
result reflects what the source audio actually contains. For heavy
music, distorted live-concert audio, or recordings dominated by
background noise, no current model produces a reliable transcript at
any price point.
2. How accuracy is measured
The percentage shown on each transcript is an estimated
accuracy, not a measured word-error-rate. We compute it from
Whisper's per-segment avg_logprob values combined with
the detected-language probability — both signals that strongly
correlate with actual transcription quality on real audio.
It's directional:
- 90%+ — model is highly confident in every
segment. In practice, near-verbatim accuracy on clean speech.
- 70-90% — usable as-is for notes, captions, or
quotation with light proofread. Most music-on-vocals lands here.
- 50-70% — directionally correct but needs human
cleanup before publication. Common for noisy field recordings.
- Below 50% — best-effort output. Read with a
grain of salt; consider a higher-quality recording.
The confidence number drives the auto-escalation ladder below: if
a tier returns below its floor, the next tier picks up automatically.
3. What affects accuracy
In rough order of impact:
- Audio quality — bitrate, signal-to-noise ratio,
dynamic range. A clean 64+ kbps voice recording out-performs a
32 kbps phone recording across every band.
- Background noise — fans, traffic, crowd hum,
keyboard clatter, instrumental tracks. The denoise pre-pass in
Attempt 2 handles most of this, but cannot recover speech that's
drowned out at the source.
- Accents & dialects — Whisper handles most
major accents reliably; very heavy regional dialects and
code-switching can drop quality by 10-20 percentage points.
- Overlapping speakers — Whisper transcribes the
dominant speaker; simultaneous speech is often dropped or merged.
For interviews / panels, recordings with separate mics per speaker
transcribe ~15 pts higher than a single room mic.
- Music presence — instrumentals mixed under
vocals are the most disruptive single factor. Songs with melodic
repetition (chorus, bridge) trigger the model's anti-hallucination
guardrails. The denoise pre-pass + song-mode params on Attempt 2
push accuracy back up but cannot make a heavy-mix EDM track
transcribable.
- Language switching mid-clip — Whisper detects
language at the start and sticks with it. A clip that opens in
English and switches to Spanish will lose the Spanish portion.
4. Auto-escalation ladder
Every transcription runs up to three tiers transparently. If a
tier returns below its floor, the next tier picks up automatically —
same transcript row, no extra charge, the user sees
one transcript that progressively improves. Cap is three attempts so
we don't loop forever on genuinely no-guarantee content.
- Attempt 1 —
small Whisper model
with voice-activity detection. Tight anti-hallucination
thresholds (compression_ratio=2.0,
no_speech=0.6). Optimised for clean speech (the
dominant case). Floor for escalation: 75%.
- Attempt 2 — denoise pre-pass strips music
and background noise via
arnndn/afftdn,
then re-runs with the medium Whisper model in
song-mode (vad_filter=False,
compression_ratio=2.4, no_speech=0.45).
Best for songs, noisy meetings, and field recordings. Floor for
escalation: 55%.
- Attempt 3 — the cleaned audio is sent to
OpenAI's hosted Whisper (
whisper-1, large model).
The heaviest pull-up we have; used only when both local engines
can't lift the result above the floor. We absorb the API cost;
you are never charged for escalation.
During escalation, the transcript row's status reads
"Auto-improving transcription — Attempt 2 of 3…" so you can
see the system working. Final attempt's confidence is what shows on
the transcript card.
5. What we transcribe
Supported sources
YouTube videos & shorts, Facebook videos &
reels, and direct audio/video uploads. Other URL sources (TikTok,
Loom, Vimeo, Instagram) are not officially supported — uploads
always work.
Upload formats
MP3, M4A, WAV, FLAC, OGG, OPUS, WMA, AIFF, AMR,
MP4, M4V, MOV, MKV, WEBM, AVI, WMV, FLV, MPEG, MPG, 3GP, TS.
Up to 2048 MB per file.
Per-file duration
20 minutes on the
Free tier; up to 4 hours on paid
tiers. Longer files should be split before upload.
What we don't transcribe
Private / geo-restricted / age-gated videos
(we can only access publicly reachable sources). Live streams
that haven't ended. DRM-protected content. Anything you don't
have the right to transcribe.
6. Language support
VTS transcribes any language Whisper supports — about 100
languages including English, Spanish, French, German, Italian,
Portuguese, Dutch, Polish, Russian, Ukrainian, Japanese, Korean,
Chinese, Hindi, Arabic, Turkish, Vietnamese, Indonesian, Filipino,
and many more.
Accuracy is highest on the languages Whisper was trained on most
heavily (English, Spanish, French, German). Lower-resource languages
may sit ~10-20 percentage points below the band ceilings even on
clean audio. Language is auto-detected at the start of the clip and
does not switch mid-stream.
7. Service availability
VTS is a best-effort service. We target high availability but do
not publish a formal uptime SLA.
- Target uptime — 99% measured monthly,
excluding scheduled maintenance and upstream provider outages
(Stripe, YouTube, Facebook, HuggingFace, OpenAI).
- Scheduled maintenance — usually announced at
least 24 hours ahead via a banner on the homepage. Most deploys
are zero-downtime and don't require maintenance windows.
- Incident communication — for outages affecting
transcription, we email signed-in users with active credit balances
or active subscriptions once the incident is identified.
- Data durability — completed transcripts are
stored in our managed Postgres with automated daily backups. We do
not guarantee permanent retention; export your transcripts (.txt /
.srt) if you need long-term archival.
8. Refund policy
Failed transcriptions are made whole — as in-system
credit, not as a refund to your card. If our system fails
to produce a usable transcript (the worker errors out, the upstream
download is blocked, the file produces an empty transcript despite
containing speech), the make-good is automatic and depends on how
you paid:
Three tracks:
- PAYG → wallet credit. Failed transcriptions
paid from your wallet have the exact cents returned to your wallet
balance within a few minutes. The credit is idempotent (a retry
can't double-credit) and the original charge appears in your
transaction history alongside the reversal so the audit trail is
intact.
- Pro subscription → minute credit. A failed
transcription for a Pro user doesn't count against the 25-hour
included pool — the minutes that were consumed by the failed run
are added back to the current period's included quota. The
transcription effectively "didn't happen" against your monthly
pool. (If you were already past the soft cap and the failed run
fell through to wallet PAYG, the credit goes back to the wallet
instead — same as the PAYG track.)
- Quality issues → re-transcribe at no charge.
If a completed transcript wasn't accurate enough, the
auto-escalation ladder usually fixes it. If not, click
Re-transcribe + denoise on the transcript row to
force-run the higher tiers (denoise + medium / large model). No
new charge, no new transcript row — the existing one is refreshed
in place.
What we do not refund:
- Subscription charges to your card. Pro is a
flat $15/month plan; we do not prorate, refund, or partially
reverse subscription charges via Stripe. Cancel anytime in the
Customer Portal and you keep Pro through the end of the current
billing period; no further charges. Locked policy as of
2026-05-27.
- Successfully-completed transcripts you're unhappy
with for stylistic reasons. The auto-ladder + the
Re-transcribe button cover quality issues; we don't reverse
charges for "I didn't like the formatting" or "the speakers
weren't labeled how I expected".
Why no Stripe-side refunds? Money flowing back
through Stripe creates accounting friction (Stripe fees aren't
refunded, dispute risk goes up, your statement gets messy). In-system
credits — wallet cents for PAYG, included minutes for Pro — are
faster, reversible, and don't cost you anything in fees or
cross-currency conversion. If Stripe themselves issue a refund (rare:
a chargeback or a duplicate-charge dispute that you win), our books
simply reflect the reversed charge: the corresponding wallet balance
is debited to keep the audit trail straight.
9. Improving your audio
The single biggest accuracy lever is the audio you upload. Two
minutes of attention upstream is worth more than any number of
re-transcribes:
- Record close to the source. A lavalier mic
six inches from the speaker beats a phone across the room by
20+ percentage points.
- Mute the music. If you're recording a
podcast / interview / vlog with a soundtrack, record the speech
track separately and add music in post. The denoise pre-pass
helps but it's not a miracle.
- One mic per speaker for panels and
multi-person interviews. Single-mic recordings of group
conversations lose ~15 pts of accuracy versus per-speaker mics.
- Pick a quiet room. Tile bathrooms, parking
garages, and reflective spaces add room echo that no denoiser
fully removes. Soft furnishings + carpet = better transcripts.
- Use 16 kHz mono or higher. Whisper resamples
to 16 kHz internally; under-sampled or lossy inputs lose detail
that can't be recovered.
- Keep clips under 20 minutes when possible.
Shorter clips trigger less hallucination cascade behaviour. For
long content, splitting at natural breaks (chapters, scene
changes) gives consistently higher accuracy than a single 2-hour
file.
10. Reporting an issue
If a transcript came back wrong in a way the auto-escalation
ladder didn't catch, or you think you were charged for something we
shouldn't have charged for, email
support@askgiya.com with
the transcript ID (visible at the top of the transcript card) and a
short description. We read every message.
For all other issues, the FAQ at
/support is the fastest first stop.
This document is the
source of truth for VTS accuracy expectations. The bands and the
ladder are implemented in code at
quality.py; this page renders from the same module so
the numbers can't drift.