Speaker Diarization in the ASR Pipeline

Overview

Blog

Speaker Diarization

A Brief look into challenges we encounter when we develop Diarization in ASR Pipelines

Kaushik Tiwari

Founder @SNR.Audio

April 6, 2024

Speaker Diarization is a core component in an Automatic Speech Recognition (ASR) pipeline that identifies who spoke in an audio file. It involves two primary tasks:

Detecting the number of speakers.
Linking each speaker to its corresponding words.

State of the art models and Speaker Diarization libraries utilise advanced Deep Learning models to perform both tasks achieving near human level performance. This has significantly enhanced the effectiveness of Speaker Diarization in ASR APIs and elevated their importance in various applications.

Steps involved in Speaker Diarization

Speaker Diarization is further split into , four major subtasks →

Detection → Find regions of audio that contain speech as opposed to silence or noise.
Segmentation → Divide and seperate audio into smaller segments.
Representation → Compress audio information , into embeddings a learnable representation of the audio data
Attribution → Annotating to each segment with a speaker id based on its embeddings

Challenges of building Speaker Diarization in ASR

Data at Clustering stage → over-clustering can lead to speech being labelled with too few speakers, while under-clustering can result in speech being assigned to the wrong speaker.
Getting High Quality Data → Realtime data which is good is hard to find clean and then label for training speaker diarization models
Multilingual ASR and Multi Speaker Diarization is a bigger challenge due to the ammount of variance in audio present , still a sota research topic many labs are finding an answer to
Ambient Noise / Crowd Noise → Interrupts with output , even after preprocessing , can mess with total number of speakers in the output and their allotment to sentences

Evaluating Speaker Diarization Systems

The main metric used for speaker diarization in the business world is the accuracy of identifying the individual speakers or "who spoke what". Most of the measures in academia are measures of "who spoke when". the best way to measure speaker diarization improvement is to measure time base confusion error rate (tCER) and time based time based diarization error rate (tDER).

Time-based Confusion Error Rate (tCER) = confusion time / total reference and model speech time

Time-based Diarization Error Rate (tDER) = False alarm time + missed detection time + confusion time / total reference and model speech time

Top Speaker Diarization Libraries

‍

If you are looking for a Managed ASR solution with the features like , Diarization , Low Latency Streaming and Websockets support and VAD , take a look at SNR.Audio's Speech to Text Solution “Scribe”,we offer generous limits to Indie developers and early stage teams to build robust voice AI systems