Accuracy expectations & service standards

What we promise on transcript quality, how we measure it, what we do automatically when a result comes back below its expected band, and what we'll do if something genuinely goes wrong on our end.

Last updated 2026-05-27 · Applies to all transcriptions processed by Ask Giya

On this page

Accuracy bands
How accuracy is measured
What affects accuracy
Auto-escalation ladder
What we transcribe
Language support
Service availability
Refund policy
Improving your audio
Reporting an issue

1. Accuracy bands

These are ranges we observe across real audio — not contractual guarantees. Audio quality, accents, overlapping speakers, and music all bound the ceiling. We auto-retry with stronger settings to push every transcript into its expected band, but no model produces a reliable transcript on heavy music or fully-degraded source audio.

Speech audio

Audio type	Expected accuracy
Clear audio, one speaker, minimal noise	90% to 98%
Clear meeting audio, multiple speakers	80% to 95%
Background noise, accents, overlapping speakers	60% to 85%
Poor audio, music, low volume, heavy noise	No guaranteed accuracy

Songs & music

Song type	Expected accuracy
Clear vocals, slow song, minimal effects	70% to 92%
Pop song with music and effects	50% to 80%
Rap, fast vocals, ad-libs, overlapping voices	40% to 75%
Heavy music, distorted vocals, live concert audio	No guaranteed accuracy

No guaranteed accuracy means we'll still transcribe — and run the full escalation ladder below — but the result reflects what the source audio actually contains. For heavy music, distorted live-concert audio, or recordings dominated by background noise, no current model produces a reliable transcript at any price point.

2. How accuracy is measured

The percentage shown on each transcript is an estimated accuracy, not a measured word-error-rate. We compute it from Whisper's per-segment avg_logprob values combined with the detected-language probability — both signals that strongly correlate with actual transcription quality on real audio.

It's directional:

90%+ — model is highly confident in every segment. In practice, near-verbatim accuracy on clean speech.
70-90% — usable as-is for notes, captions, or quotation with light proofread. Most music-on-vocals lands here.
50-70% — directionally correct but needs human cleanup before publication. Common for noisy field recordings.
Below 50% — best-effort output. Read with a grain of salt; consider a higher-quality recording.

The confidence number drives the auto-escalation ladder below: if a tier returns below its floor, the next tier picks up automatically.

3. What affects accuracy

In rough order of impact:

Audio quality — bitrate, signal-to-noise ratio, dynamic range. A clean 64+ kbps voice recording out-performs a 32 kbps phone recording across every band.
Background noise — fans, traffic, crowd hum, keyboard clatter, instrumental tracks. The denoise pre-pass in Attempt 2 handles most of this, but cannot recover speech that's drowned out at the source.
Accents & dialects — Whisper handles most major accents reliably; very heavy regional dialects and code-switching can drop quality by 10-20 percentage points.
Overlapping speakers — Whisper transcribes the dominant speaker; simultaneous speech is often dropped or merged. For interviews / panels, recordings with separate mics per speaker transcribe ~15 pts higher than a single room mic.
Music presence — instrumentals mixed under vocals are the most disruptive single factor. Songs with melodic repetition (chorus, bridge) trigger the model's anti-hallucination guardrails. The denoise pre-pass + song-mode params on Attempt 2 push accuracy back up but cannot make a heavy-mix EDM track transcribable.
Language switching mid-clip — Whisper detects language at the start and sticks with it. A clip that opens in English and switches to Spanish will lose the Spanish portion.

4. Auto-escalation ladder

Every transcription runs up to three tiers transparently. If a tier returns below its floor, the next tier picks up automatically — same transcript row, no extra charge, the user sees one transcript that progressively improves. Cap is three attempts so we don't loop forever on genuinely no-guarantee content.

Attempt 1 — small Whisper model with voice-activity detection. Tight anti-hallucination thresholds (compression_ratio=2.0, no_speech=0.6). Optimised for clean speech (the dominant case). Floor for escalation: 75%.
Attempt 2 — denoise pre-pass strips music and background noise via arnndn/afftdn, then re-runs with the medium Whisper model in song-mode (vad_filter=False, compression_ratio=2.4, no_speech=0.45). Best for songs, noisy meetings, and field recordings. Floor for escalation: 55%.
Attempt 3 — the cleaned audio is sent to OpenAI's hosted Whisper (whisper-1, large model). The heaviest pull-up we have; used only when both local engines can't lift the result above the floor. We absorb the API cost; you are never charged for escalation.

During escalation, the transcript row's status reads "Auto-improving transcription — Attempt 2 of 3…" so you can see the system working. Final attempt's confidence is what shows on the transcript card.

5. What we transcribe

Supported sources

YouTube videos & shorts, Facebook videos & reels, and direct audio/video uploads. Other URL sources (TikTok, Loom, Vimeo, Instagram) are not officially supported — uploads always work.

Upload formats

MP3, M4A, WAV, FLAC, OGG, OPUS, WMA, AIFF, AMR, MP4, M4V, MOV, MKV, WEBM, AVI, WMV, FLV, MPEG, MPG, 3GP, TS. Up to 2048 MB per file.

Per-file duration

20 minutes on the Free tier; up to 4 hours on paid tiers. Longer files should be split before upload.

What we don't transcribe

Private / geo-restricted / age-gated videos (we can only access publicly reachable sources). Live streams that haven't ended. DRM-protected content. Anything you don't have the right to transcribe.

6. Language support

Ask Giya transcribes any language Whisper supports — about 100 languages including English, Spanish, French, German, Italian, Portuguese, Dutch, Polish, Russian, Ukrainian, Japanese, Korean, Chinese, Hindi, Arabic, Turkish, Vietnamese, Indonesian, Filipino, and many more.

Accuracy is highest on the languages Whisper was trained on most heavily (English, Spanish, French, German). Lower-resource languages may sit ~10-20 percentage points below the band ceilings even on clean audio. Language is auto-detected at the start of the clip and does not switch mid-stream.

7. Service availability

Ask Giya is a best-effort service. We target high availability but do not publish a formal uptime SLA.

Target uptime — 99% measured monthly, excluding scheduled maintenance and upstream provider outages (Stripe, YouTube, Facebook, HuggingFace, OpenAI).
Scheduled maintenance — usually announced at least 24 hours ahead via a banner on the homepage. Most deploys are zero-downtime and don't require maintenance windows.
Incident communication — for outages affecting transcription, we email signed-in users with active credit balances or active subscriptions once the incident is identified.
Data durability — completed transcripts are stored in our managed Postgres with automated daily backups. We do not guarantee permanent retention; export your transcripts (.txt / .srt) if you need long-term archival.

8. Refund policy

Failed transcriptions are made whole — as in-system credit, not as a refund to your card. If our system fails to produce a usable transcript (the worker errors out, the upstream download is blocked, the file produces an empty transcript despite containing speech), the make-good is automatic and depends on how you paid:

Three tracks:

PAYG → wallet credit. Failed transcriptions paid from your wallet have the exact cents returned to your wallet balance within a few minutes. The credit is idempotent (a retry can't double-credit) and the original charge appears in your transaction history alongside the reversal so the audit trail is intact.
Pro subscription → minute credit. A failed transcription for a Pro user doesn't count against the 25-hour included pool — the minutes that were consumed by the failed run are added back to the current period's included quota. The transcription effectively "didn't happen" against your monthly pool. (If you were already past the soft cap and the failed run fell through to wallet PAYG, the credit goes back to the wallet instead — same as the PAYG track.)
Quality issues → re-transcribe at no charge. If a completed transcript wasn't accurate enough, the auto-escalation ladder usually fixes it. If not, click Re-transcribe + denoise on the transcript row to force-run the higher tiers (denoise + medium / large model). No new charge, no new transcript row — the existing one is refreshed in place.

What we do not refund:

Subscription charges to your card. Pro is a flat $15/month plan; we do not prorate, refund, or partially reverse subscription charges via Stripe. Cancel anytime in the Customer Portal and you keep Pro through the end of the current billing period; no further charges. Locked policy as of 2026-05-27.
Successfully-completed transcripts you're unhappy with for stylistic reasons. The auto-ladder + the Re-transcribe button cover quality issues; we don't reverse charges for "I didn't like the formatting" or "the speakers weren't labeled how I expected".

Why no Stripe-side refunds? Money flowing back through Stripe creates accounting friction (Stripe fees aren't refunded, dispute risk goes up, your statement gets messy). In-system credits — wallet cents for PAYG, included minutes for Pro — are faster, reversible, and don't cost you anything in fees or cross-currency conversion. If Stripe themselves issue a refund (rare: a chargeback or a duplicate-charge dispute that you win), our books simply reflect the reversed charge: the corresponding wallet balance is debited to keep the audit trail straight.

9. Improving your audio

The single biggest accuracy lever is the audio you upload. Two minutes of attention upstream is worth more than any number of re-transcribes:

Record close to the source. A lavalier mic six inches from the speaker beats a phone across the room by 20+ percentage points.
Mute the music. If you're recording a podcast / interview / vlog with a soundtrack, record the speech track separately and add music in post. The denoise pre-pass helps but it's not a miracle.
One mic per speaker for panels and multi-person interviews. Single-mic recordings of group conversations lose ~15 pts of accuracy versus per-speaker mics.
Pick a quiet room. Tile bathrooms, parking garages, and reflective spaces add room echo that no denoiser fully removes. Soft furnishings + carpet = better transcripts.
Use 16 kHz mono or higher. Whisper resamples to 16 kHz internally; under-sampled or lossy inputs lose detail that can't be recovered.
Keep clips under 20 minutes when possible. Shorter clips trigger less hallucination cascade behaviour. For long content, splitting at natural breaks (chapters, scene changes) gives consistently higher accuracy than a single 2-hour file.

10. Reporting an issue

If a transcript came back wrong in a way the auto-escalation ladder didn't catch, or you think you were charged for something we shouldn't have charged for, email [email protected] with the transcript ID (visible at the top of the transcript card) and a short description. We read every message.

For all other issues, the FAQ at /support is the fastest first stop.

This document is the source of truth for Ask Giya accuracy expectations. The bands and the ladder are implemented in code at quality.py; this page renders from the same module so the numbers can't drift.