Speaker Diarization

A Brief look into challenges we encounter when we develop Diarization in ASR Pipelines
Kaushik Tiwari
Founder @SNR.Audio
April 6, 2024

Speaker Diarization is a core component in an Automatic Speech Recognition (ASR) pipeline that identifies who spoke in an audio file. It involves two primary tasks:

  • Detecting the number of speakers.
  • Linking each speaker to its corresponding words.

State of the art models and Speaker Diarization libraries utilise advanced Deep Learning models to perform both tasks achieving near human level performance. This has significantly enhanced the effectiveness of Speaker Diarization in ASR APIs and elevated their importance in various applications.

Steps involved in Speaker Diarization

Speaker Diarization is further split into , four major subtasks →

  • Detection Find regions of audio that contain speech as opposed to silence or noise.
  • Segmentation Divide and seperate audio into smaller segments.
  • Representation Compress audio information , into embeddings a learnable representation of the audio data
  • Attribution Annotating  to each segment with a speaker id  based on its embeddings

Challenges of building Speaker Diarization in ASR

  • Data at Clustering stage → over-clustering can lead to speech being labelled with too few speakers, while under-clustering can result in speech being assigned to the wrong speaker.
  • Getting High Quality Data → Realtime data which is good is hard to find clean and then label for training speaker diarization models
  • Multilingual ASR and   Multi Speaker Diarization is a bigger challenge due to the ammount of variance in audio present , still a sota research topic many labs are finding an answer to
  • Ambient Noise / Crowd Noise → Interrupts with output , even after preprocessing , can mess with total number of speakers in the output and their allotment to sentences

Evaluating Speaker Diarization Systems

The main metric used for speaker diarization in the business world is the accuracy of identifying the individual speakers or "who spoke what". Most of the measures in academia are measures of "who spoke when".  the best way to measure speaker diarization improvement is to measure time base confusion error rate (tCER) and time based time based diarization error rate (tDER).

Time-based Confusion Error Rate (tCER) = confusion time / total reference and model speech time

Time-based Diarization Error Rate (tDER) =  False alarm time + missed detection time + confusion time / total reference and model speech time

Top Speaker Diarization Libraries

If you are looking for a Managed ASR solution with the features like , Diarization , Low Latency Streaming and Websockets support and VAD , take a look at SNR.Audio's Speech to Text Solution “Scribe”,we offer generous limits to Indie developers and early stage teams to build robust voice AI systems

Share

Explore related blogs

Blog

LLAMA3

A look into key advancements Details and Performance numbers of Meta's Flagship Model
Kaushik Tiwari
Blog

Retrieval Augmented Generation

A soft summary into Retrieval Augmented Generation
Kaushik Tiwari
Blog

CLIP

An Overview on CLIP a model which connected Image and Text as modalities by Aligning text to the image
Kaushik Tiwari