Text is All You Need: Personalizing ASR Models using Controllable Speech
Synthesis
- URL: http://arxiv.org/abs/2303.14885v1
- Date: Mon, 27 Mar 2023 02:50:02 GMT
- Title: Text is All You Need: Personalizing ASR Models using Controllable Speech
Synthesis
- Authors: Karren Yang, Ting-Yao Hu, Jen-Hao Rick Chang, Hema Swetha Koppula,
Oncel Tuzel
- Abstract summary: Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data.
Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis.
- Score: 17.172909510518814
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Adapting generic speech recognition models to specific individuals is a
challenging problem due to the scarcity of personalized data. Recent works have
proposed boosting the amount of training data using personalized text-to-speech
synthesis. Here, we ask two fundamental questions about this strategy: when is
synthetic data effective for personalization, and why is it effective in those
cases? To address the first question, we adapt a state-of-the-art automatic
speech recognition (ASR) model to target speakers from four benchmark datasets
representative of different speaker types. We show that ASR personalization
with synthetic data is effective in all cases, but particularly when (i) the
target speaker is underrepresented in the global data, and (ii) the capacity of
the global model is limited. To address the second question of why personalized
synthetic data is effective, we use controllable speech synthesis to generate
speech with varied styles and content. Surprisingly, we find that the text
content of the synthetic data, rather than style, is important for speaker
adaptation. These results lead us to propose a data selection strategy for ASR
personalization based on speech content.
Related papers
- Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space [10.875499903992782]
We conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification.
Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data.
Despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features.
arXiv Detail & Related papers (2024-09-19T13:07:55Z) - Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [48.527630771422935]
We propose a synthetic data generation pipeline for multi-speaker conversational ASR.
We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings.
arXiv Detail & Related papers (2024-08-17T14:47:05Z) - Communication-Efficient Personalized Federated Learning for
Speech-to-Text Tasks [66.78640306687227]
To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems.
The commonly used FL approach (i.e., textscFedAvg) in S2T tasks typically suffers from extensive communication overhead.
We propose a personalized federated S2T framework that introduces textscFedLoRA, a lightweight LoRA module for client-side tuning and interaction with the server, and textscFedMem, a global model equipped with a $k$-near
arXiv Detail & Related papers (2024-01-18T15:39:38Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion
Recognition [42.09340937787435]
We investigated the representation ability of different speech self-supervised pre-trained models.
We employed a powerful large language model (LLM), GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate emotionally congruent text and speech.
arXiv Detail & Related papers (2023-09-19T03:52:01Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - Residual-guided Personalized Speech Synthesis based on Face Image [14.690030837311376]
Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds.
In this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder.
arXiv Detail & Related papers (2022-04-01T15:27:14Z) - Data-augmented cross-lingual synthesis in a teacher-student framework [3.2548794659022398]
Cross-lingual synthesis is the task of letting a speaker generate fluent synthetic speech in another language.
Previous research shows that many models appear to have insufficient generalization capabilities.
We propose to apply the teacher-student paradigm to cross-lingual synthesis.
arXiv Detail & Related papers (2022-03-31T20:01:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.