Accuracy expectations & service standards

What we promise on transcript quality, how we measure it, what we do automatically when a result comes back below its expected band, and what we'll do if something genuinely goes wrong on our end.

Last updated 2026-05-27 · Applies to all transcriptions processed by VTS

On this page
  1. Accuracy bands
  2. How accuracy is measured
  3. What affects accuracy
  4. Auto-escalation ladder
  5. What we transcribe
  6. Language support
  7. Service availability
  8. Refund policy
  9. Improving your audio
  10. Reporting an issue

1. Accuracy bands

These are ranges we observe across real audio — not contractual guarantees. Audio quality, accents, overlapping speakers, and music all bound the ceiling. We auto-retry with stronger settings to push every transcript into its expected band, but no model produces a reliable transcript on heavy music or fully-degraded source audio.

Speech audio

Audio typeExpected accuracy
Clear audio, one speaker, minimal noise90% to 98%
Clear meeting audio, multiple speakers80% to 95%
Background noise, accents, overlapping speakers60% to 85%
Poor audio, music, low volume, heavy noiseNo guaranteed accuracy

Songs & music

Song typeExpected accuracy
Clear vocals, slow song, minimal effects70% to 92%
Pop song with music and effects50% to 80%
Rap, fast vocals, ad-libs, overlapping voices40% to 75%
Heavy music, distorted vocals, live concert audioNo guaranteed accuracy

No guaranteed accuracy means we'll still transcribe — and run the full escalation ladder below — but the result reflects what the source audio actually contains. For heavy music, distorted live-concert audio, or recordings dominated by background noise, no current model produces a reliable transcript at any price point.

2. How accuracy is measured

The percentage shown on each transcript is an estimated accuracy, not a measured word-error-rate. We compute it from Whisper's per-segment avg_logprob values combined with the detected-language probability — both signals that strongly correlate with actual transcription quality on real audio.

It's directional:

The confidence number drives the auto-escalation ladder below: if a tier returns below its floor, the next tier picks up automatically.

3. What affects accuracy

In rough order of impact:

4. Auto-escalation ladder

Every transcription runs up to three tiers transparently. If a tier returns below its floor, the next tier picks up automatically — same transcript row, no extra charge, the user sees one transcript that progressively improves. Cap is three attempts so we don't loop forever on genuinely no-guarantee content.

  1. Attempt 1small Whisper model with voice-activity detection. Tight anti-hallucination thresholds (compression_ratio=2.0, no_speech=0.6). Optimised for clean speech (the dominant case). Floor for escalation: 75%.
  2. Attempt 2 — denoise pre-pass strips music and background noise via arnndn/afftdn, then re-runs with the medium Whisper model in song-mode (vad_filter=False, compression_ratio=2.4, no_speech=0.45). Best for songs, noisy meetings, and field recordings. Floor for escalation: 55%.
  3. Attempt 3 — the cleaned audio is sent to OpenAI's hosted Whisper (whisper-1, large model). The heaviest pull-up we have; used only when both local engines can't lift the result above the floor. We absorb the API cost; you are never charged for escalation.

During escalation, the transcript row's status reads "Auto-improving transcription — Attempt 2 of 3…" so you can see the system working. Final attempt's confidence is what shows on the transcript card.

5. What we transcribe

Supported sources
YouTube videos & shorts, Facebook videos & reels, and direct audio/video uploads. Other URL sources (TikTok, Loom, Vimeo, Instagram) are not officially supported — uploads always work.
Upload formats
MP3, M4A, WAV, FLAC, OGG, OPUS, WMA, AIFF, AMR, MP4, M4V, MOV, MKV, WEBM, AVI, WMV, FLV, MPEG, MPG, 3GP, TS. Up to 2048 MB per file.
Per-file duration
20 minutes on the Free tier; up to 4 hours on paid tiers. Longer files should be split before upload.
What we don't transcribe
Private / geo-restricted / age-gated videos (we can only access publicly reachable sources). Live streams that haven't ended. DRM-protected content. Anything you don't have the right to transcribe.

6. Language support

VTS transcribes any language Whisper supports — about 100 languages including English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Japanese, Korean, Chinese, Hindi, Arabic, Turkish, Vietnamese, Indonesian, Filipino, and many more.

Accuracy is highest on the languages Whisper was trained on most heavily (English, Spanish, French, German). Lower-resource languages may sit ~10-20 percentage points below the band ceilings even on clean audio. Language is auto-detected at the start of the clip and does not switch mid-stream.

7. Service availability

VTS is a best-effort service. We target high availability but do not publish a formal uptime SLA.

8. Refund policy

Failed transcriptions are made whole — as in-system credit, not as a refund to your card. If our system fails to produce a usable transcript (the worker errors out, the upstream download is blocked, the file produces an empty transcript despite containing speech), the make-good is automatic and depends on how you paid:

Three tracks:

What we do not refund:

Why no Stripe-side refunds? Money flowing back through Stripe creates accounting friction (Stripe fees aren't refunded, dispute risk goes up, your statement gets messy). In-system credits — wallet cents for PAYG, included minutes for Pro — are faster, reversible, and don't cost you anything in fees or cross-currency conversion. If Stripe themselves issue a refund (rare: a chargeback or a duplicate-charge dispute that you win), our books simply reflect the reversed charge: the corresponding wallet balance is debited to keep the audit trail straight.

9. Improving your audio

The single biggest accuracy lever is the audio you upload. Two minutes of attention upstream is worth more than any number of re-transcribes:

10. Reporting an issue

If a transcript came back wrong in a way the auto-escalation ladder didn't catch, or you think you were charged for something we shouldn't have charged for, email support@askgiya.com with the transcript ID (visible at the top of the transcript card) and a short description. We read every message.

For all other issues, the FAQ at /support is the fastest first stop.

This document is the source of truth for VTS accuracy expectations. The bands and the ladder are implemented in code at quality.py; this page renders from the same module so the numbers can't drift.