Related papers: Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis

URL: http://arxiv.org/abs/2303.14885v1
Date: Mon, 27 Mar 2023 02:50:02 GMT
Title: Text is All You Need: Personalizing ASR Models using Controllable Speech Synthesis
Authors: Karren Yang, Ting-Yao Hu, Jen-Hao Rick Chang, Hema Swetha Koppula, Oncel Tuzel
Abstract summary: Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis.
Score: 17.172909510518814
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Adapting generic speech recognition models to specific individuals is a challenging problem due to the scarcity of personalized data. Recent works have proposed boosting the amount of training data using personalized text-to-speech synthesis. Here, we ask two fundamental questions about this strategy: when is synthetic data effective for personalization, and why is it effective in those cases? To address the first question, we adapt a state-of-the-art automatic speech recognition (ASR) model to target speakers from four benchmark datasets representative of different speaker types. We show that ASR personalization with synthetic data is effective in all cases, but particularly when (i) the target speaker is underrepresented in the global data, and (ii) the capacity of the global model is limited. To address the second question of why personalized synthetic data is effective, we use controllable speech synthesis to generate speech with varied styles and content. Surprisingly, we find that the text content of the synthetic data, rather than style, is important for speaker adaptation. These results lead us to propose a data selection strategy for ASR personalization based on speech content.

Related papers

AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis [19.141058309358424]
This study proposes a text-to-speech (TTS) framework based on Retrieval-Augmented Generation (RAG) technology. We have constructed a speech style knowledge database containing high-quality speech samples in various contexts. This scheme uses embeddings, extracted by Llama, PER-LLM-Embedder,and Moka, to match with samples in the knowledge database, selecting the most appropriate speech style for synthesis.
arXiv Detail & Related papers (2025-04-14T15:18:59Z)
Nexus-O: An Omni-Perceptive And -Interactive Model for Language, Audio, And Vision [50.23246260804145]
We introduce textbfNexus-O, an industry-level textbfomni-perceptive and -interactive model capable of efficiently processing Audio, Image, Video, and Text data. We address three key research questions: First, how can models be efficiently designed and trained to achieve tri-modal alignment, understanding and reasoning capabilities across multiple modalities? Second, what approaches can be implemented to evaluate tri-modal model robustness, ensuring reliable performance and applicability in real-world scenarios? Third, what strategies can be employed to curate and obtain high-quality, real-life scenario
arXiv Detail & Related papers (2025-02-26T17:26:36Z)
Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.<n>To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.<n>To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z)
Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs. We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z)
Enhancing Synthetic Training Data for Speech Commands: From ASR-Based Filtering to Domain Adaptation in SSL Latent Space [10.875499903992782]
We conduct a set of experiments around zero-shot learning with synthetic speech data for the specific task of speech commands classification. Our results on the Google Speech Commands dataset show that a simple ASR-based filtering method can have a big impact in the quality of the generated data. Despite the good quality of the generated speech data, we also show that synthetic and real speech can still be easily distinguishable when using self-supervised (WavLM) features.
arXiv Detail & Related papers (2024-09-19T13:07:55Z)
Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition [48.527630771422935]
We propose a synthetic data generation pipeline for multi-speaker conversational ASR. We conduct evaluation by fine-tuning the Whisper ASR model for telephone and distant conversational speech settings.
arXiv Detail & Related papers (2024-08-17T14:47:05Z)
Communication-Efficient Personalized Federated Learning for Speech-to-Text Tasks [66.78640306687227]
To protect privacy and meet legal regulations, federated learning (FL) has gained significant attention for training speech-to-text (S2T) systems. The commonly used FL approach (i.e., textscFedAvg) in S2T tasks typically suffers from extensive communication overhead. We propose a personalized federated S2T framework that introduces textscFedLoRA, a lightweight LoRA module for client-side tuning and interaction with the server, and textscFedMem, a global model equipped with a $k$-near
arXiv Detail & Related papers (2024-01-18T15:39:38Z)
Disentangling Voice and Content with Self-Supervision for Speaker Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech. It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z)
Leveraging Speech PTM, Text LLM, and Emotional TTS for Speech Emotion Recognition [42.09340937787435]
We investigated the representation ability of different speech self-supervised pre-trained models. We employed a powerful large language model (LLM), GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate emotionally congruent text and speech.
arXiv Detail & Related papers (2023-09-19T03:52:01Z)
EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis. This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles. We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z)
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
Residual-guided Personalized Speech Synthesis based on Face Image [14.690030837311376]
Previous works derive personalized speech features by training the model on a large dataset composed of his/her audio sounds. In this work, we innovatively extract personalized speech features from human faces to synthesize personalized speech using neural vocoder.
arXiv Detail & Related papers (2022-04-01T15:27:14Z)
Data-augmented cross-lingual synthesis in a teacher-student framework [3.2548794659022398]
Cross-lingual synthesis is the task of letting a speaker generate fluent synthetic speech in another language. Previous research shows that many models appear to have insufficient generalization capabilities. We propose to apply the teacher-student paradigm to cross-lingual synthesis.
arXiv Detail & Related papers (2022-03-31T20:01:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.