TTS

Primer on TTS
Kaushik Tiwari
Founder @SNR.Audio
April 18, 2024

In the previous two posts  we understood how , the voice (ASR) reaches the brain (LLM) and how the brain responds to it , we understood risks of each and how to build a system on your own , what to watch out for and how to evaluate performance of them individually , now we will try to understand , how will the brain respond to a user in almost real time.

Text to Speech

Text to speech is the task of generating natural sounding , speech which it produces given a text input , Text to Speech Models , can be extended ,to have a single model that generates speech for multiple speakers and have multiple languages , to start out with its best to begin with English, simply for the reason of its easier to find and curate data which is high quality data in this language but initiatives like common-voice from mozzila ai is lowering the bar for high quality datasets even further

Why is TTS so difficult to solve ?

Any language , when converted to text contains words which spell the same  but change pronunciation when are said out loud , to understand this we will do a small activity right now  all you have to do is spell these sentences out loud

  • Do you live <l ih v> near a zoo with live </l ay v> animals ?
  • I prefer bass </b ae s/ > fishing to playing the bass </b ey s> guitar.
  • Its no use</y uw s/> if you ask me to use</y uw z/> the telephone

this highlights , a significant challenge in building a natural text to speech system ie Text Normalisation

TTS systems require special preprocessing for handling non-standard words -

  • Numbers
  • Monetary Amounts
  • Abbreviations
  • Dates
  • Acronyms

Modern end to end TTS systems , can learn to do some normalization themselves however , due to limited , amount of training data , a seperate normalisation step is needed .

  1. Rules(eg: Regex)
  2. Seq2Seq model (requires a bit more post preprocessing)

Even , different outputs for speech spectrograms or audio waveforms can coresspond to the same ground truth . The model has to learn how to generate the correct duration and timing for each phoneme , word or sentence which can be challenging , especially for long and complex sentences .Sentence meaning often hinges on the context of surrounding words, particularly when considering the temporal dimension. For Text-to-Speech (TTS) models to produce natural and coherent speech, capturing and retaining contextual information across extended sequences becomes crucial.

Top 5 Text to Speech Models

  • XTTS by CoquiXTTS is a Voice generation model that lets you clone voices into different languages by using just a quick 6-second audio clip. There is no need for an excessive amount of training data that spans countless hours.
  • Bark by Suno Bark can generate highly realistic , multilingual speech as well as other audio including background noise and other audio effects, the model is also capable to generate , emotion based non verbal audio like laughing crying , feared , shock , gasp , sighs
  • Metavoice1Bv0.1 Metavoice 1Bv0.1 is a text to speech model which is trained on 100k+ hours of speech for TTS , it has been built to produce , emotionally smart rythm and tone in English , reducing hallucinations and supporting long form speech synthesis
  • MyShellai/OpenVoice → main features of this speech synthesis model include , accurate voice cloning , flexible style voice control and zero shot cross lingual voice-cloning
  • Meta/Multilingual-Speech-Streaming→ this is a famous streaming speech synthesis model from Facebook, it supports simulataneous tts over 36 languages with built in support for streaming data

Text-to-Speech Evaluation: Human Bias and Mitigating Strategies

One of the major challenges in evaluating, scaling, and managing Text-to-Speech (TTS) models within complex voice AI systems lies in the inherent subjectivity of human opinion. Unlike Automatic Speech Recognition (ASR) models and Large Language Models (LLMs), where metrics can be more objective, TTS evaluation heavily relies on human perception, introducing potential bias and variability.

While subjectivity poses a significant challenge, a well-designed and controlled evaluation process can significantly mitigate these risks. It is crucial to acknowledge that:

  • Human perception of speech varies: Pronunciation preferences can differ based on accent, dialect, and individual listener perception. What sounds pleasant in British English might not be well-received in other regions like Uganda, India, or France.
  • Mean Opinion Scores (MOS) serve as a useful tool for gauging perceived quality, but their subjective nature necessitates careful interpretation and analysis.

Here are some strategies to enhance the objectivity and effectiveness of your TTS evaluation process:

1. Diverse Evaluation Team: Assemble a team of evaluators with varied backgrounds, accents, and demographics to capture a broader range of perspectives.

2. Standardized Criteria: Establish clear and objective criteria for evaluation, focusing on elements like naturalness, clarity, and prosody.

3. Contextualized Evaluation: Conduct evaluations in contexts relevant to the intended use case of the TTS system. For example, evaluate educational content differently from customer service interactions.

4. Iterative Approach: Continuously iterate your evaluation process based on insights gained from each round of testing.

5. Explore Objective Metrics: While human evaluation remains crucial, investigate complementary objective metrics like pitch, intonation, and speech rate to offer additional data points.

By implementing these strategies, you can mitigate the impact of human bias and create a more robust and effective evaluation process for your Text-to-Speech models, ultimately leading to improved voice AI systems, or you can skip all this by using  which already has done this for you , so that you can do you what you are good at helping your customers

Share

Explore related blogs

Blog

LLAMA3

A look into key advancements Details and Performance numbers of Meta's Flagship Model
Kaushik Tiwari
Blog

Retrieval Augmented Generation

A soft summary into Retrieval Augmented Generation
Kaushik Tiwari
Blog

CLIP

An Overview on CLIP a model which connected Image and Text as modalities by Aligning text to the image
Kaushik Tiwari