How to Transcribe UX Research Interviews Step by Step

A 45-minute usability session, three follow-ups in your tab queue, and a synthesis deck due Friday. The recording sits on your laptop, and the difference between a Friday review and a weekend of replays is a clean, timestamped transcript.

UX research interviews aren't lectures. People hesitate, restart sentences, and talk over each other when there's a moderator and a notetaker in the same call. Get the transcript right and the analysis nearly writes itself. Get it wrong and you're scrubbing audio every time a stakeholder asks what P4 said about checkout.

Here's the workflow that actually holds up.

Why transcripts matter for UX research

You can take notes during the session. You can pull quotes from memory. Neither survives a stakeholder who wants the exact phrase a participant used when they couldn't find the cart icon.

A searchable transcript turns six hours of interviews into a Cmd-F problem. Search "couldn't find", "expected", "frustrating", and the themes surface themselves. Tag the quotes with the participant ID and the timestamp, and your synthesis has receipts.

Done well, the transcript is your audit trail. Proof that the insight came from the data, not the prevailing opinion in the room.

What to record before you transcribe

The cleanest transcripts start at the recording, not the transcription tool. A few things matter more than they should:

Separate audio tracks. If your platform (Zoom, Lookback, dscout, UserInterviews.com) gives you per-speaker audio, take it. Speaker labels become trivially accurate, and the moderator's voice doesn't bleed into the participant's quotes.
A real mic on the participant. Built-in laptop mics catch keyboard taps, A/C hum, and the dog. The error rate on a noisy 45-minute interview can climb past 10%.
Quiet room, no headphone bleed. Open-back headphones leak the moderator's voice back into the participant's track and confuse diarization.
One file, not many. Hit stop once, not seven times. Stitching segments together is where timestamps drift.

For the deeper version, see best practices for audio quality before transcribing.

Before you start

You need a recording (audio or video), a transcription tool that does speaker diarization, and a coding tool you already use for synthesis (Dovetail, ATLAS.ti, NVivo, MAXQDA, or even a tagged Notion board). The end state is a speaker-labeled, timestamped transcript per participant, ready to code for themes.

Step by step: from recording to a coding-ready transcript

Export the highest-quality audio you have.

WAV if your recorder gives it, otherwise the original M4A from Zoom Cloud Recording, not a re-encoded MP3. Expected: one audio file per session, usually 100 MB+ for a 45-minute call. Common pitfall: exporting MP3 at 64 kbps strips the high frequencies diarization uses to separate voices. Re-export at the highest setting available.

Upload to a transcription tool with speaker diarization.

You want labeled speakers (Speaker 1 / Speaker 2) and timestamps on every utterance. Transcribe your interview here, drop the file, and you'll get a timestamped transcript with speaker labels in a few minutes. Expected: a transcript with timecodes and Speaker N labels for a 45-minute file in under five minutes of processing.

Rename the speakers and skim for accuracy.

Replace "Speaker 1" with "Moderator" and "Speaker 2" with the participant code you use (P4, P12, whatever your study calls them). Scan for proper nouns, product names, and acronyms; those are where AI transcription drops words most often. Expected: every quote is attributable. Common pitfall: if labels get swapped mid-interview, listen back to a 10-second clip near the swap. The cause is almost always a stretch of crosstalk.

Decide verbatim or clean read.

For thematic analysis, clean read usually wins: filler words removed, false starts collapsed. For discourse analysis or studying hesitation as a signal, go verbatim. We unpacked the tradeoff in verbatim vs intelligent transcription.

Export to your coding tool.

Most teams export plain text or DOCX into Dovetail, ATLAS.ti, NVivo, or MAXQDA. Keep the timestamps. You'll want them when a stakeholder asks where exactly the quote came from.

How do you handle multi-speaker interviews?

Two-speaker interviews (moderator + participant) are the easy case. It gets messy when a notetaker chimes in, a stakeholder observer asks a follow-up, or a co-design session has three or more participants in the room.

A few things that consistently hold up:

Diarization gets harder with each added voice. Three people sharing one mic in one room is the hardest case. If you can give each person their own channel, do.
Crosstalk drops accuracy. When two voices overlap, even the best AI hands you a guess. Moderate one-at-a-time. Both your analysis and your participants will thank you.
Background voices register as speakers. If a kid yells in the next room, some tools create a "Speaker 4" for that snippet. Delete those rows and move on.

If you're curious what's happening under the hood, here's a plain-English guide to speaker diarization.

How do you code a transcript for themes?

Coding is where the transcript earns its keep. The workflow most teams converge on:

Open coding. Read through and highlight anything interesting. Attach a short label ("can't find cart", "expects FAQ", "wants live chat"). Don't overthink it on the first pass.
Cluster into themes. After two or three interviews, codes start grouping. A theme is a code that shows up across multiple participants.
Tag quotes back to participant IDs. Each theme should be traceable to at least 2 to 3 participants, otherwise it's an anecdote, not a finding.
Pick the money quote per theme. This is the line you'll use in the deck. A clean transcript turns that into five minutes of work.

Coding tools like Dovetail, ATLAS.ti, and NVivo automate the mechanics. They don't think for you.

Verbatim or intelligent transcription: which fits UX research?

For most synthesis work, intelligent (clean read) transcription is what you want. You're after the content of what the participant said, not their speech patterns. Filler words clutter the page and don't add insight.

The exception is when how something is said is the data. If you're studying confusion, hesitation, or emotional response, the "um"s and pauses matter. Same for accessibility research where pace and clarity are themes.

How long does it take to transcribe a UX interview?

AI transcription of a 45-minute interview takes 1 to 5 minutes of processing. Speaker cleanup and a skim review add another 10 to 20 minutes. Plan on roughly 30 minutes of human time per hour of audio.

That's a 10 to 15x speedup over manual transcription, which historically runs 4 to 6 hours per hour of audio. For a six-participant study, you save a working day. For what to expect on word-level accuracy, see transcription accuracy: what to expect.

What about confidentiality and IRB?

If your study is IRB-approved (university research) or regulated (healthcare, financial services), check your transcription provider on four things:

They don't train models on your uploaded files.
You can delete data on demand.
They publish a retention policy you're comfortable with.
They'll sign a DPA or BAA if your participants are patients or otherwise covered.

For pure commercial research without PHI or PII, most modern AI transcription tools clear the bar. For sensitive work, default to providers who explicitly state "no training on your data", and confirm it in writing before you upload.

Try it now — it's free

Transcribe your video with Ask Giya

Paste any public link or upload a file and get a clean transcript in minutes. First 3 clips every month are on us — no card required.

Start transcribing No subscription · 8¢/min after free clips