A focus group looks easy on paper. Six people in a room (or on a call), a moderator with a script, an hour of conversation. Then you sit down with the recording and realize three people were talking at once, somebody coughed over the key quote, and you have no idea which voice belongs to which name.

Focus groups are some of the hardest audio you'll ever ask software to transcribe. Multiple speakers, overlapping speech, side conversations, and the moderator's small talk bleeding into the actual session. Here's how to get a transcript you can code without re-listening to the whole tape with one ear.

Why focus groups are harder than interviews

A one-on-one interview is two voices and a clear hand-off. A focus group is six to ten voices with no hand-off, and the most useful moments are usually the ones where people start interrupting each other.

That breaks naive transcription in three ways. Speech overlaps confuse diarization models that assume one speaker at a time. Voices recorded through a single shared microphone sound flatter and more similar than they should, so the model can't separate them by tone. And the participants who matter most, the ones with strong opinions, talk fastest and quietest, exactly the ones AI tends to drop words on.

If you've handled a Zoom recording with multiple speakers before, this is that, harder.

How do you set up the recording so it transcribes cleanly?

The biggest accuracy lever is the microphone, not the software. A good boundary mic in the middle of a six-person table will outperform an expensive AI model fed phone audio every time.

For in-person focus groups, use a USB conference mic with omnidirectional pickup (Jabra Speak, Logitech Group, or a boundary mic like the MXL AC404). Place it in the center of the table, away from laptops and HVAC vents. If your room is bigger than a four-top, add a second mic and record to a second track.

For remote focus groups, ask every participant to wear headphones and use a wired headset mic, then record in Zoom or Teams with "record separate audio file for each participant" turned on. That single setting changes the transcript from a hot mess into something a model can label cleanly, because each speaker is now on their own track.

More on the upstream side here: audio quality before you hit record.

Step by step: transcribe a focus group recording

1
Export each speaker as a separate track if you can.

From Zoom: Settings → Recording → "Record a separate audio file of each participant." From Teams: pull the per-participant track in the post-meeting download. From a physical recorder: one .wav per mic.

2
Loudness-normalize and trim the dead air.

Audacity or Adobe Audition. Drop to mono unless you have a stereo reason. Cut the pre-roll where the moderator is still setting up coffee.

3
Upload the file (or files) with speaker diarization on.

If you have per-speaker tracks, you don't need diarization, because the tracks already separate the voices. If you only have one mixed track, diarization is what gives you "Speaker 1 / Speaker 2 / …" labels.

4
Rename Speaker 1 → "Maria (P3)" right away.

Don't wait. Scrub to the first minute of each speaker and assign their real participant ID. If you wait, you'll forget who's who.

5
Skim for the obvious model errors.

Acronyms, brand names, and code-switching (English → Spanish → English mid-sentence) are where AI drops words. Fix those first.

6
Export to .docx or plain text and import into your coding tool

— NVivo, Atlas.ti, Dedoose, Taguette. Most accept timestamped transcripts and will let you click a quote back to the audio.

How accurate is AI on overlapping voices?

Honest answer: AI handles two overlapping speakers reasonably well and three-or-more overlapping speakers poorly. You'll get the gist right but lose words at the overlaps.

If you separated tracks at the source (per-participant recording in Zoom, or one mic per person in the room), accuracy on each individual track is roughly the same as a clean interview: 92-96% word accuracy on a good microphone. If you only have a single mixed track from a room mic, expect 85-92%, and budget time to fix the overlaps by hand.

What that means for budget: a one-hour focus group with per-speaker tracks needs ~15-20 minutes of cleanup. A one-hour focus group recorded on one room mic with five participants can need a full hour of cleanup. Plan the higher number.

How long does a one-hour focus group actually take to transcribe?

The recording is one hour. The transcript is not.

AI transcription itself takes 3-10 minutes for a one-hour file. Diarization adds another couple of minutes. The work is what comes after: speaker labeling (5-10 minutes), an accuracy pass (15-30 minutes), and a final pass where you fix names of products, people, and places the model didn't know (5-10 minutes).

So a one-hour focus group, end to end, is usually 30-60 minutes of human time on top of the AI. Still 4-8x faster than manual transcription, which runs 4-6 hours per recorded hour for a focus group with overlap.

Is paying per minute worth it for a 90-minute session?

Subscription tools like Otter charge $8.33-$20/month with monthly hour caps. Rev charges $1.50/min for human transcription and around $0.25/min for AI. For a research team running ten 90-minute focus groups a year, that's 15 hours of audio, which is where the math gets interesting.

Pay-per-minute with diarization runs cheaper than the equivalent subscription if you're under ~20 hours/month and you don't want to commit to a recurring bill. If you want to transcribe a focus group recording without a subscription, and without locking participant audio into a tool you can't leave, that's exactly the use case we built for.

Try it now — it's free
Transcribe your video with VTS

Paste any public link or upload a file and get a clean transcript in minutes. First 3 clips every month are on us — no card required.

Start transcribing No subscription · 8¢/min after free clips

What do you do with the transcript once you have it?

Read it twice before you code anything. The first read is to remember what happened. The second is to mark quotes that surprised you, the ones you didn't expect. Those are usually the ones that actually move the analysis.

Then import into your coding software, build a code book, and start tagging. The transcript is the source of truth; the codes are how you summarize. Saldaña's coding manual is still the standard reference for the workflow if you've never coded qualitative data before.

If you're doing a study from scratch and want the wider arc, from recruitment through write-up, there's more here: speeding up qualitative research with transcription.

Sources