TTS

Overview

Blog

TTS

Primer on TTS

Kaushik Tiwari

Founder @SNR.Audio

April 18, 2024

In the previous two posts we understood how , the voice (ASR) reaches the brain (LLM) and how the brain responds to it , we understood risks of each and how to build a system on your own , what to watch out for and how to evaluate performance of them individually , now we will try to understand , how will the brain respond to a user in almost real time.

Text to Speech

Text to speech is the task of generating natural sounding , speech which it produces given a text input , Text to Speech Models , can be extended ,to have a single model that generates speech for multiple speakers and have multiple languages , to start out with its best to begin with English, simply for the reason of its easier to find and curate data which is high quality data in this language but initiatives like common-voice from mozzila ai is lowering the bar for high quality datasets even further

Why is TTS so difficult to solve ?

Any language , when converted to text contains words which spell the same but change pronunciation when are said out loud , to understand this we will do a small activity right now all you have to do is spell these sentences out loud

Do you live <l ih v> near a zoo with live </l ay v> animals ?
I prefer bass </b ae s/ > fishing to playing the bass </b ey s> guitar.
Its no use</y uw s/> if you ask me to use</y uw z/> the telephone

this highlights , a significant challenge in building a natural text to speech system ie Text Normalisation

TTS systems require special preprocessing for handling non-standard words -

Numbers
Monetary Amounts
Abbreviations
Dates
Acronyms

Modern end to end TTS systems , can learn to do some normalization themselves however , due to limited , amount of training data , a seperate normalisation step is needed .

Rules(eg: Regex)
Seq2Seq model (requires a bit more post preprocessing)

Even , different outputs for speech spectrograms or audio waveforms can coresspond to the same ground truth . The model has to learn how to generate the correct duration and timing for each phoneme , word or sentence which can be challenging , especially for long and complex sentences .Sentence meaning often hinges on the context of surrounding words, particularly when considering the temporal dimension. For Text-to-Speech (TTS) models to produce natural and coherent speech, capturing and retaining contextual information across extended sequences becomes crucial.

Top 5 Text to Speech Models

XTTS by Coqui → XTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours.
Bark by Suno → Bark can generate highly realistic , multilingual speech as well as other audio including background noise and other audio effects, the model is also capable to generate , emotion based non verbal audio like laughing crying , feared , shock , gasp , sighs
Metavoice1Bv0.1 → Metavoice 1Bv0.1 is a text to speech model which is trained on 100k+ hours of speech for TTS , it has been built to produce , emotionally smart rythm and tone in English , reducing hallucinations and supporting long form speech synthesis
MyShellai/OpenVoice → main features of this speech synthesis model include , accurate voice cloning , flexible style voice control and zero shot cross lingual voice-cloning
Meta/Multilingual-Speech-Streaming→ this is a famous streaming speech synthesis model from Facebook, it supports simulataneous tts over 36 languages with built in support for streaming data

Text-to-Speech Evaluation: Human Bias and Mitigating Strategies

One of the major challenges in evaluating, scaling, and managing Text-to-Speech (TTS) models within complex voice AI systems lies in the inherent subjectivity of human opinion. Unlike Automatic Speech Recognition (ASR) models and Large Language Models (LLMs), where metrics can be more objective, TTS evaluation heavily relies on human perception, introducing potential bias and variability.

While subjectivity poses a significant challenge, a well-designed and controlled evaluation process can significantly mitigate these risks. It is crucial to acknowledge that:

Human perception of speech varies: Pronunciation preferences can differ based on accent, dialect, and individual listener perception. What sounds pleasant in British English might not be well-received in other regions like Uganda, India, or France.
Mean Opinion Scores (MOS) serve as a useful tool for gauging perceived quality, but their subjective nature necessitates careful interpretation and analysis.

Here are some strategies to enhance the objectivity and effectiveness of your TTS evaluation process:

1. Diverse Evaluation Team: Assemble a team of evaluators with varied backgrounds, accents, and demographics to capture a broader range of perspectives.

2. Standardized Criteria: Establish clear and objective criteria for evaluation, focusing on elements like naturalness, clarity, and prosody.

3. Contextualized Evaluation: Conduct evaluations in contexts relevant to the intended use case of the TTS system. For example, evaluate educational content differently from customer service interactions.

4. Iterative Approach: Continuously iterate your evaluation process based on insights gained from each round of testing.

5. Explore Objective Metrics: While human evaluation remains crucial, investigate complementary objective metrics like pitch, intonation, and speech rate to offer additional data points.

By implementing these strategies, you can mitigate the impact of human bias and create a more robust and effective evaluation process for your Text-to-Speech models, ultimately leading to improved voice AI systems, or you can skip all this by using which already has done this for you , so that you can do you what you are good at helping your customers

‍