Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition
- URL: http://arxiv.org/abs/2511.20534v1
- Date: Tue, 25 Nov 2025 17:35:57 GMT
- Title: Bridging the Language Gap: Synthetic Voice Diversity via Latent Mixup for Equitable Speech Recognition
- Authors: Wesley Bian, Xiaofeng Lin, Guang Cheng,
- Abstract summary: Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages.<n>This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly.<n>We introduce a novel data augmentation technique for speech corpora designed to mitigate this gap.
- Score: 8.948233216872211
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern machine learning models for audio tasks often exhibit superior performance on English and other well-resourced languages, primarily due to the abundance of available training data. This disparity leads to an unfair performance gap for low-resource languages, where data collection is both challenging and costly. In this work, we introduce a novel data augmentation technique for speech corpora designed to mitigate this gap. Through comprehensive experiments, we demonstrate that our method significantly improves the performance of automatic speech recognition systems on low-resource languages. Furthermore, we show that our approach outperforms existing augmentation strategies, offering a practical solution for enhancing speech technology in underrepresented linguistic communities.
Related papers
- Empowering Global Voices: A Data-Efficient, Phoneme-Tone Adaptive Approach to High-Fidelity Speech Synthesis [5.283520143851873]
We present a novel methodology that integrates a data-optimized framework with an advanced acoustic model to build high-quality TTS systems.<n>We demonstrate the effectiveness of our approach using Thai as an illustrative case, where intricate phonetic rules and sparse resources are effectively addressed.
arXiv Detail & Related papers (2025-04-10T15:32:57Z) - Bridging The Multi-Modality Gaps of Audio, Visual and Linguistic for Speech Enhancement [36.136070412464214]
Speech enhancement (SE) aims to improve the quality and intelligibility of speech in noisy environments.<n>Recent studies have shown that incorporating visual cues in audio signal processing can enhance SE performance.<n>We propose a novel multi-modal learning framework, termed DLAV-SE, which leverages a diffusion-based model integrating audio, visual, and linguistic information.
arXiv Detail & Related papers (2025-01-23T04:36:29Z) - Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection [49.27067541740956]
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction.<n>Building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese.<n>We propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages.
arXiv Detail & Related papers (2024-09-17T08:36:45Z) - SMILE: Speech Meta In-Context Learning for Low-Resource Language Automatic Speech Recognition [55.2480439325792]
Speech Meta In-Context LEarning (SMILE) is an innovative framework that combines meta-learning with speech in-context learning (SICL)<n>We show that SMILE consistently outperforms baseline methods in training-free few-shot multilingual ASR tasks.
arXiv Detail & Related papers (2024-09-16T16:04:16Z) - Large Language Models for Dysfluency Detection in Stuttered Speech [16.812800649507302]
Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components.
Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, we approach the task of multi-label dysfluency detection as a language modeling problem.
We present hypotheses candidates generated with an automatic speech recognition system and acoustic representations extracted from an audio encoder model to an LLM, and finetune the system to predict dysfluency labels on three datasets containing English and German stuttered speech.
arXiv Detail & Related papers (2024-06-16T17:51:22Z) - Reduce, Reuse, Recycle: Is Perturbed Data better than Other Language augmentation for Low Resource Self-Supervised Speech Models [48.44820587495038]
Self-supervised representation learning (SSRL) has demonstrated superior performance than supervised models for tasks including phoneme recognition.
Training SSRL models poses a challenge for low-resource languages where sufficient pre-training data may not be available.
We propose to use audio augmentation techniques, namely: pitch variation, noise addition, accented target language and other language speech to pre-train SSRL models in a low resource condition and evaluate phoneme recognition.
arXiv Detail & Related papers (2023-09-22T10:09:09Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - Cross-lingual Transfer for Speech Processing using Acoustic Language
Similarity [81.51206991542242]
Cross-lingual transfer offers a compelling way to help bridge this digital divide.
Current cross-lingual algorithms have shown success in text-based tasks and speech-related tasks over some low-resource languages.
We propose a language similarity approach that can efficiently identify acoustic cross-lingual transfer pairs across hundreds of languages.
arXiv Detail & Related papers (2021-11-02T01:55:17Z) - Self-Supervised Representations Improve End-to-End Speech Translation [57.641761472372814]
We show that self-supervised pre-trained features can consistently improve the translation performance.
Cross-lingual transfer allows to extend to a variety of languages without or with little tuning.
arXiv Detail & Related papers (2020-06-22T10:28:38Z) - Meta-Transfer Learning for Code-Switched Speech Recognition [72.84247387728999]
We propose a new learning method, meta-transfer learning, to transfer learn on a code-switched speech recognition system in a low-resource setting.
Our model learns to recognize individual languages, and transfer them so as to better recognize mixed-language speech by conditioning the optimization on the code-switching data.
arXiv Detail & Related papers (2020-04-29T14:27:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.