Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization
- URL: http://arxiv.org/abs/2509.21718v1
- Date: Fri, 26 Sep 2025 00:28:50 GMT
- Title: Align2Speak: Improving TTS for Low Resource Languages via ASR-Guided Online Preference Optimization
- Authors: Shehzeen Hussain, Paarth Neekhara, Xuesong Yang, Edresson Casanova, Subhankar Ghosh, Roy Fejgin, Ryan Langman, Mikyas Desta, Leili Tavabi, Jason Li,
- Abstract summary: We propose a framework to adapt an autoregressive, multilingual TTS model to new languages.<n>We fine-tune this model on limited paired data of the new languages to capture the target language's prosodic features.<n>Experiments demonstrate that this pipeline produces intelligible and speaker-consistent speech in low-resource languages.
- Score: 13.222167833914924
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing high-quality text-to-speech (TTS) systems for low-resource languages is challenging due to the scarcity of paired text and speech data. In contrast, automatic speech recognition (ASR) models for such languages are often more accessible, owing to large-scale multilingual pre-training efforts. We propose a framework based on Group Relative Policy Optimization (GRPO) to adapt an autoregressive, multilingual TTS model to new languages. Our method first establishes a language-agnostic foundation for TTS synthesis by training a multilingual baseline with International Phonetic Alphabet (IPA) tokens. Next, we fine-tune this model on limited paired data of the new languages to capture the target language's prosodic features. Finally, we apply GRPO to optimize the model using only unpaired text and speaker prompts, guided by a multi-objective reward from pretrained ASR, speaker verification, and audio quality estimation models. Experiments demonstrate that this pipeline produces intelligible and speaker-consistent speech in low-resource languages, substantially outperforming fine-tuning alone. Furthermore, our GRPO-based framework also improves TTS performance in high-resource languages, surpassing offline alignment methods such as Direct Preference Optimization (DPO) yielding superior intelligibility, speaker similarity, and audio quality.
Related papers
- Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis [5.283520143851873]
We present a novel methodology that integrates a data-optimized framework with an advanced acoustic model to build high-quality TTS systems.<n>We demonstrate the effectiveness of our approach using Thai as an illustrative case, where intricate phonetic rules and sparse resources are effectively addressed.
arXiv Detail & Related papers (2025-04-10T15:32:57Z) - Whisper-LM: Improving ASR Models with Language Models for Low-Resource Languages [0.43498389175652036]
This study integrates traditional and novel language models with fine-tuned Whisper models to raise their performance in less commonly studied languages.<n>We demonstrate substantial improvements in word error rate, particularly in low-resource scenarios.<n>While the integration reliably benefits all model sizes, the extent of improvement varies, highlighting the importance of optimized language model parameters.
arXiv Detail & Related papers (2025-03-30T18:03:52Z) - Whisper Speaker Identification: Leveraging Pre-Trained Multilingual Transformers for Robust Speaker Embeddings [0.0]
We propose WSI (Whisper Speaker Identification), a framework that repurposes the Whisper automatic speech recognition model pre trained on extensive multilingual data.<n>By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages.
arXiv Detail & Related papers (2025-03-13T15:11:28Z) - Enhancing Multilingual ASR for Unseen Languages via Language Embedding Modeling [50.62091603179394]
Whisper, one of the most advanced ASR models, handles 99 languages effectively.<n>However, Whisper struggles with unseen languages, those not included in its pre-training.<n>We propose methods that exploit these relationships to enhance ASR performance on unseen languages.
arXiv Detail & Related papers (2024-12-21T04:05:43Z) - Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.<n>Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.<n>We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition [55.2480439325792]
Speech Meta In-Context LEarning (SMILE) is an innovative framework that combines meta-learning with speech in-context learning (SICL)<n>We show that SMILE consistently outperforms baseline methods in training-free few-shot multilingual ASR tasks.
arXiv Detail & Related papers (2024-09-16T16:04:16Z) - An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system.
We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Language-universal phonetic encoder for low-resource speech recognition [28.21805271848413]
We leverage International Phonetic Alphabet (IPA) based language-universal phonetic model to improve low-resource ASR performances.
Our approach and adaptation are effective on extremely low-resource languages, even within domain- and language-mismatched scenarios.
arXiv Detail & Related papers (2023-05-19T10:24:30Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Bootstrap an end-to-end ASR system by multilingual training, transfer
learning, text-to-text mapping and synthetic audio [8.510792628268824]
bootstrapping speech recognition on limited data resources has been an area of active research for long.
We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer based automatic speech recognition (ASR) system in the low resource regime.
Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements.
arXiv Detail & Related papers (2020-11-25T13:11:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.