GuardScribe: Real-time speech-to-text with toxicity filtering

GuardScribe: Real-time speech-to-text with toxicity filtering

August 25, 20255 min read
speechnlpmoderationrealtimewhisperwav2vec2distilbert

TL;DR

I built a realtime “listen → transcribe → filter → show” pipeline. It skips silence, turns speech into text, checks if the text is toxic, and masks only the toxic words. There are two speech-to-text (ASR) engines: wav2vec2 (fast) and Whisper (more accurate). With wav2vec2, average end-to-end latency is under 400 ms. With Whisper, it’s ~1.1–1.8 s but the text is cleaner. The sentence-level toxicity model reached F1 0.807, Precision 0.835, Recall 0.781, PR-AUC 0.845.

This started as a Natural Language Programming (NLP) course project. Not a paper — just a practical build to learn the full workflow and measure latency.

Why I built this

Live audio is everywhere: streams, games, voice chat, assistants. Most moderation happens after the fact. I wanted to see if we can catch harmful language as it happens without making the conversation feel slow. The goal: fast enough, accurate enough, and not annoying.

What’s inside

  • Silence skipping (VAD). WebRTC VAD only lets speech frames through, so we don’t waste time on silent parts.

  • Two ASR choices.

    • wav2vec2-base → lowest latency
    • Whisper-base → higher accuracy (uses a small buffer)
  • Two-step toxicity check (DistilBERT).

    1. Sentence classifier decides if the line is toxic.
    2. If yes, a span tagger finds the exact toxic words to mask.
  • Smart masking. Only the toxic spans get masked. Everything else stays readable.

  • Parallel pipeline. Audio capture, ASR, and moderation run in parallel with small buffers to balance speed and context.

How it works (under the hood)

Capture & gate: Mic audio at 16 kHz (mono) goes through WebRTC VAD. Only frames tagged as speech enter the queue.

Transcribe: A small buffer groups speech frames. For Whisper I use a ~2 s window with a 0.5 s step to keep context while staying responsive. Each chunk is sent to the chosen ASR backend and the ASR latency is recorded.

Moderate: The sentence classifier scores the transcript. If the probability crosses a threshold (default 0.70), the token‑level tagger predicts BIO spans for toxic words. I mask just those spans (or the whole sentence if token masking is disabled). The final output is the cleaned text and the total latency = ASR + moderation (+ optional span tagging).

Pseudocode snapshot

Code
# AudioCapture → queue
while running:
  frame = mic.read()
  if vad.is_speech(frame):
    audio_q.enqueue(frame)

# SpeechToText → queue
buffer = []
while running:
  frame = audio_q.dequeue()
  buffer.append(frame)
  if silence>THRESH or duration(buffer)>=BUFFER_MS:
    audio = concat(buffer)
    text, asr_ms = asr.transcribe(audio)
    transcript_q.enqueue(text, asr_ms)
    buffer.clear()

# ModerateText
while running:
  text, asr_ms = transcript_q.dequeue()
  p = sentence_clf(text)
  if p>=THRESH:
    spans = token_tagger(text)
    clean = mask(text, spans)
  else:
    clean = text
  total_ms = asr_ms + moderation_ms
  out_q.enqueue(clean, total_ms)

Datasets and models

  • ASR checks: LJSpeech for WER, CER, and latency
  • Toxicity training: Jigsaw Toxic Comments (sentence level) and Toxic Spans (character-level tagging)
  • Backbones: DistilBERT for moderation, wav2vec2 or Whisper for ASR

Training & eval setup (quick facts)

  • Fine‑tuning: DistilBERT (sentence classifier) for 3 epochs on Jigsaw; DistilBERT (token BIO head) for 3 epochs on Toxic Spans.
  • Hardware for fine‑tuning: single T4 GPU.
  • ASR eval: 100 random LJSpeech utterances.
  • Toxicity eval: 2,000 Toxic Spans test sentences (token tagging) + held‑out Jigsaw split for sentence classification.
  • Whisper streaming config: ~2 s buffer, 0.5 s hop (overlapping windows for context).

Results

ASR quality and latency

ASR modelWER %CER %Typical E2E latencyRTF
Whisper-base6.71.9~1.2 s0.31
wav2vec2-base10.04.8~139 ms0.02

Toxicity detection (sentence level)

MetricValue
F10.807
Precision0.835
Recall0.781
PR-AUC0.845
Threshold0.70

Span tagging (word/char level)

TaskMetricValue
Toxic spansF10.623

End-to-end latency by mode

PathTypical E2EBest fit use case
wav2vec2 path< 400 msInteractive chat and games
Whisper path1.1–1.8 sStreams and broadcasts

Design choices that helped

  • VAD first to cut silence and reduce perceived lag

  • Short streaming buffer for Whisper to keep context without big delays

  • Two-stage moderation so the span tagger only runs when needed

  • Threshold tuning (default 0.7) to lower false positives

    • Lower it → more recall (catches more, masks more)
    • Raise it → more conservative (masks less)
  • Span masking beats whole-sentence redaction for readability

Evaluation details (what I measured)

  • ASR metrics: WER, CER, RTF, and measured per‑chunk latency.
  • Moderation metrics: sentence‑level precision/recall/F1/PR‑AUC; token‑level span F1 (BIO tagged spans → character spans).
  • Throughput & latency: total E2E time per utterance and component‑level costs (ASR dominates; token tagging is ~5 ms, sentence scoring is a few ms).

Error patterns (and what to do about them)

  • False positives: harmless idioms (like “break a leg”) and quoted slurs can be flagged. Using context checks or allow‑list rules for quoted content will help reduce False Positives.
  • False negatives: implied/multi‑sentence abuse or contextual insults can slip through. Lowering the threshold or adding short memory across turns will help.
  • ASR issues: the worst 5–10% of utterances (fast speech, rare words) hurt moderation the most. Considering to fallback or ask‑to‑repeat prompts on low ASR confidence chunks will help reduce the ASR issue.

Latency budget & runtime notes

  • wav2vec2 path: ~139 ms ASR on CPU‑friendly settings; total < 400 ms E2E is realistic for interactive use.
  • Whisper path: ~1.2 s ASR due to larger context; total 1.1–1.8 s with span masking.
  • Token tagger: ~5 ms per utterance; negligible.
  • Bench box: Quad‑core CPU, 32 GB RAM, GTX 1050. Training/fine‑tuning done on a single T4.

Where this fits

  • Live streams & social audio: tiny delay + span masking to ease manual review
  • In-game voice chat: low-latency path with a conservative threshold
  • Assistants & contact centers: edge/on-device moderation to avoid server trips

Limitations

  • ASR errors can confuse the classifier (homophones are tricky)
  • Context is hard: sarcasm, quotes, multi-turn abuse still cause misses
  • Domain shift: typed comments ≠ spontaneous speech
  • Audio bleeping needs a small delay; text masking is instant

What’s next

  • Hybrid ASR router Start with wav2vec2 for short/clear utterances. If ASR confidence is low or the utterance is longer, switch to Whisper. Keep chunked streaming to cut perceived delay.
  • Short memory + diarization to handle ongoing abuse across turns
  • Edge optimization: quantization/distillation to aim for < 250 ms E2E on laptops
  • Reviewer UI: quick human review for tough calls, with continual learning

Conclusion

I went in thinking model choice would decide everything. It didn’t. VAD, tiny buffers, and masking strategy mattered just as much. The result isn’t perfect, but it’s useful and it runs fast enough to feel natural.

Github Repo : https://github.com/hdprajwal/Guardscribe

Paper : https://github.com/hdprajwal/Guardscribe/blob/main/toxic_content_filtering_in_stt.pdf