Automatic Speech Recognition

Overview

Blog

A primer on Automatic Speech Recognition

Kaushik Tiwari

Founder @SNR.Audio

March 25, 2024

Humans are inherently social beings. We excel in the company of our families, collaborate effectively in teams, find profound meaning and purpose in religious gatherings, and navigate through economic partnerships and political alliances with ease. Our cultural values, a reflection of our collective living, have a profound impact on our behaviours and norms. For instance, when we engage in conversations, whether in person, on the phone, or through a video call, we naturally communicate better, increase productivity, generate innovative ideas, and create remarkable achievements. This sense of naturalness, this essence of being human, is precisely why we, at SNR Audio , firmly believe that we are on the brink of something extraordinary. We are not satisfied with keeping this knowledge to ourselves; instead, we strongly believe in the importance of transparency and building trust collaboratively with everyone. That is why this guide contains the essential checklist for evaluating what we consider to be the components of a production grade voice AI system: how it feels, how it behaves, and how it responds , at the same doing this without burning a hole in your Pocket

The Role of ASR in a Real Time Voice AI pipeline

This is the initial stage of interacting with a Voice AI system. Its main task is to accurately transcribe the entire conversation, capturing human elements such as pauses, verbal habits, and variations in speech speed. The system is designed to:

Handle delays in speech when the user pauses or is engaged in another activity.
Recognise and transcribe emotions expressed in the conversation.
Prioritise reliability, accuracy, and speed by providing real-time transcription as soon as the voice data is received.
Support integration with various frontends, audio codecs, and sound formats for seamless compatibility.
Identify and transcribe multiple speakers in a conversation, known as speaker diarization.
Integrate with popular audio communication platforms like Twilio, WebSockets, and Vonage.

Top 5 Models in Automatic Speech Recognition

Evaluating ASR Systems

Nowadays, most models in this field are assessed using variations of a classic metric known as Word Error Rate (WER). WER quantifies the total number of errors in transcription compared to a human transcriber, which is considered as the ground truth. In general, a lower WER indicates better models, assuming all other conditions remain constant. However, comparing and evaluating these models can be challenging when there is a discrepancy between what the model deems valuable and what we, as transcribers, consider useful. This can lead to situations where the model, if evaluated solely based on WER, may appear inflated and ultimately impractical .

Tuning of parameters for decoding in automatic speech recognition — WORD ERROR RATE

‍

with these sanity checks , we will make sure the system is tight and robust :-

Evaluate the nouns → proper nouns carry more information than common words (a, and , the) if the WER of two models is the same and one omits proper nouns and other one omits common words then we now know that , which one is better .
Evaluate the nouns in relation to ground truth → lets say b/w two models A and B again now the first model omits a proper noun which is more similar to the ground truth , and second model omits the noun which is less similar to the ground truth consider they have the same number of nouns
A lot of the problems with WER were studied and the metric Jaro-Winkler distance was formed which solved the problem of WER having a coarse understanding of similarity the Jaro-Winkler assigns a cost to errors of transcription depending on how similar they are , errors similar to the ground truth , are assigned a lower cost where is errors which are completely off the ground truth receive much harsher punishment , allowing the models to improve performance.

Jaro-Winkler Distance Algorithm | SAP Blogs — Jaro Winkler

The real-time factor (RTF) is also a common metric for measuring the speed of an automatic speech recognition system (ASR) during the decoding phase, also known as "at run-time". It can also be used in other contexts where audio or video signals are processed at a nearly constant rate, usually automatically. RTF is a measure of the latency of any audio processing system, not just a speech recognition engine, but also a text-to-speech engine, a transcoding engine, and so on, if it takes 8 hours of computation time to process a 2-hour recording, the real-time factor is 4. When the real-time factor is 1, the processing is done "in real time". It is a value that depends on the hardware and network bandwidth, which is important to note if the processing is done as a cloud-based service.
Typically, state-of-the-art speech-to-text cloud-based services provided by Google, Azure, AWS, etc. have values between 0.2 and 0.6. It is important to note that these values depend on various factors such as network/internet bandwidth and speech content. In the case of an on-prem ASR, the algorithm and hardware resources (CPU/RAM) are the major factors affecting the real-time factor

Additional challenges in deploying a scalable ASR system:

Domain Specificity

Our objective is to achieve a model that performs at a high level. However, it's important to note that the performance of the model is influenced by the training dataset. While models may excel in certain use cases, they may not perform as well in other applications, particularly in ASR, where factors such as accent, language, and speech variations can impact performance.

The positive aspect is that you have the ability to tailor models to specific applications. For instance, in fields like healthcare or finance, domain-specific vocabulary plays a critical role, which is distinct from ASR model training.

To address the need for regional language support in ASR, it is essential to have comprehensive training pipelines that facilitate easy model customisation and enable handling of different dialects.

Monitoring and tracking

Real-time monitoring and tracking are extremely valuable as they offer immediate insights, alerts, and notifications. This allows for timely corrective actions when necessary and also aids in monitoring resource usage based on incoming traffic, enabling automatic scaling of the application. Quota limits can be established to optimise infrastructure costs without impacting overall throughput. To capture these statistics, several libraries need to be integrated to capture performance at different stages of the ASR pipeline. For instance, if there is a pause or excessive noise during a 40-minute call, the system automatically sends an alert to the call monitoring agency, and they either end the call or forward it to the agency manager.

Keeping all these important features in mind we at SNR Audio have built our Streaming Solution Scribe ASR , which helps developers reduce their cost by upto 50% on scale, compared to Deepgram ,Assembly , AWS