Skip to content

Mismatch vs Non-Verbal — choosing the right rule

VoiceQC ships with two Auto-rules: Mismatch and Non-Verbal. They look superficially similar — both flag Session items and apply a Property automatically — but they answer different questions. This page is about when each is the right tool, and (just as importantly) what neither catches.

Mismatch ruleNon-Verbal rule
Question it answers”Did the actor say the right words?""Did the actor say words at all?”
Runs onThe transcript (after speech recognition)The audio (before speech recognition)
Needs a ScriptYes — without one, nothing to compare againstNo
Typical Property appliedStatus / MismatchStatus / Non-verbal or Quality / No take
Main knobNormalization toggles + SynonymsSpeech threshold (0.00 to 1.00)

The order matters. Non-Verbal runs first and removes the takes that aren’t really speech. Mismatch runs second and QCs the takes that are speech against what was supposed to be said.

Use the Mismatch rule when:

  • You have a script — every line was written before it was recorded.
  • You care whether the take matches the script — wrong word, dropped phrase, paraphrase, line ad-libbed.
  • The transcription quality is good enough that “the words differ” is a real signal, not noise from the speech recognizer.

Mismatch is most useful in scripted dialogue workflows: video game voice-over, audiobook narration, animated film looping. It’s less useful for unscripted material (interviews, livestreams) where there’s no canonical text to compare against.

Use the Non-Verbal rule when:

  • Your batches include takes that shouldn’t be transcribed at all — slate ins, room tone, fake takes, breath samples, an actor warming up.
  • You want to filter those out automatically rather than scrolling past them in the items table.
  • You don’t necessarily have a script.

Non-Verbal is the trash filter. It catches the stuff that would otherwise pollute your transcription pile. It’s especially valuable on raw production batches that haven’t been hand-curated by an editor first.

The rules compose well:

  1. Non-Verbal removes the noise floor — takes the actor never really delivered.
  2. Mismatch flags the takes where the actor delivered something, but not what was scripted.

What’s left after both run is a clean column of items that are real speech and match the script. The flagged items are the ones a human should listen to.

This is the part most people get wrong on first read, so it’s worth being blunt:

VoiceQC works at the transcript layer. It does not analyze waveforms.

Audio problemVoiceQC catches it?
Wrong words / dropped linesYes — Mismatch
Empty or non-speech takesYes — Non-Verbal
Clipping (peaks above 0 dBFS)No
Background noise / room toneNo
Mouth clicks, plosives, sibilanceNo
Loudness / RMS / LUFS issuesNo
Mix balance / panning / dialogue duckingNo
Phoneme-level mispronunciationNo (only catches the words your speech recognizer chose, not how they were pronounced)

For DSP-level audio analysis, use a dedicated tool — iZotope RX, an offline LUFS meter, your DAW’s analyzer plugins. VoiceQC sits next to those tools, not on top of them.

The “no waveform analysis” choice is deliberate. Two reasons:

  • The transcript layer is where the team disagreement is. Catching clipping tells you the audio engineer didn’t gain-stage well. Catching a mismatch tells you the take is unusable because the actor said the wrong thing — a much higher-stakes finding for a voice-over pipeline.
  • Audio analysis is a crowded category. Building a worse version of iZotope RX wouldn’t help anyone. Building the only automated “did they say the script?” check has clear value.

If audio-level QC matters to your workflow, the practical pattern is: run your DSP tool first (catch the clipping, loudness, etc.), then upload the cleaned takes to VoiceQC for content QC.