Optimising Latency while building Text to Speech API at scale

Overview

Blog

Latency

A Post on optimising Latency

Kaushik Tiwari

Founder @SNR.Audio

April 6, 2024

Text-to-Speech (TTS) has become a crucial tool for various applications, including virtual assistants, automated customer service, and real-time translation services , inbound calls , outbound calls essentially a robust TTS service , is the difference b/w a good quality Voice AI system and a poor quality one often because , this is the service which interacts at the end with the user , there are many factors which contribute to a good quality TTS , but probably the most important of them is Latency

Latency in the context of TTS

Latency refers to the time delay between the input of text and the corresponding audio output. It encompasses the time taken for the system to process the input text, generate the appropriate audio response, and deliver it to the user.

Input Processing: The time taken by the system to interpret and preprocess the input text, including tasks such as text normalization, language identification, and tokenization.
Speech Synthesis: The process of generating synthetic speech from the input text, which involves linguistic analysis, acoustic modeling, and waveform generation. This is the most computationally intensive part of the TTS pipeline and can contribute significantly to the overall latency.
Audio Playback: The time required to render and deliver the generated audio to the user, which can be affected by factors such as network conditions, device capabilities, and audio buffer size

In conclusion High latency systems can lead to delayed audio responses, causing frustration and confusion among users. On the other hand, low latency ensures a seamless and responsive experience, enabling users to interact with the TTS system in real-time.

How to reduce latency in speech synthesis applications?

The primary factors influencing latency:

Hardware and software efficiency
Signal travel distance
Network traffic

Optimizing software is just as crucial as selecting robust hardware, if your solution is on premises choosing a low level , low resource text to speech engine becomes very crucial

To minimize signal travel distance, it's essential to position voice data near compute resources or vice versa. However, for large speech models, server-based cloud infrastructure is typically the only viable option, as smaller hardware often lacks sufficient processing power to handle voice data efficiently. Yet, relying solely on server-based solutions excludes devices like mobile phones, desktop computers, and microcontrollers that generate voice data. Therefore, it's necessary to bring computing capabilities closer to the data source. This approach is referred to as on-device inference .if Building systems with low Latency is crucial for your voice product, feel free to try SNR Audio’s Text to speech API which we are building keeping in mind , the needs of a real time voice ai system , low latency , High Throughput , Multiple Speakers , and costs upto only 10 percent of market leaders like Elevenlabs and PlayHT