SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution
- URL: http://arxiv.org/abs/2510.25178v1
- Date: Mon, 27 Oct 2025 21:39:07 GMT
- Title: SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution
- Authors: Dharma Teja Donepudi,
- Abstract summary: Intra-sentence multilingual speech synthesis (code-switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages.<n>We introduce Script-First Multilingual Synthesis with Adaptive Locale Resolution (SFMS-ALR), an engine-agnostic framework for fluent, real-time code-switched speech generation.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Intra-sentence multilingual speech synthesis (code-switching TTS) remains a major challenge due to abrupt language shifts, varied scripts, and mismatched prosody between languages. Conventional TTS systems are typically monolingual and fail to produce natural, intelligible speech in mixed-language contexts. We introduce Script-First Multilingual Synthesis with Adaptive Locale Resolution (SFMS-ALR), an engine-agnostic framework for fluent, real-time code-switched speech generation. SFMS-ALR segments input text by Unicode script, applies adaptive language identification to determine each segment's language and locale, and normalizes prosody using sentiment-aware adjustments to preserve expressive continuity across languages. The algorithm generates a unified SSML representation with appropriate "lang" or "voice" spans and synthesizes the utterance in a single TTS request. Unlike end-to-end multilingual models, SFMS-ALR requires no retraining and integrates seamlessly with existing voices from Google, Apple, Amazon, and other providers. Comparative analysis with data-driven pipelines such as Unicom and Mask LID demonstrates SFMS-ALR's flexibility, interpretability, and immediate deployability. The framework establishes a modular baseline for high-quality, engine-independent multilingual TTS and outlines evaluation strategies for intelligibility, naturalness, and user preference.
Related papers
- PART: Progressive Alignment Representation Training for Multilingual Speech-To-Text with LLMs [58.2469845374385]
We introduce Progressive Alignment Representation Training (PART)<n>PART is a multi-stage and multi-task framework that separates within-language from cross-language alignment.<n>Experiments on CommonVoice 15, Fleurs, Wenetspeech, and CoVoST2 show that PART surpasses conventional approaches.
arXiv Detail & Related papers (2025-09-24T03:54:14Z) - SwitchLingua: The First Large-Scale Multilingual and Multi-Ethnic Code-Switching Dataset [34.40254709148148]
Code-Switching (CS) is the alternating use of two or more languages within a conversation or utterance.<n>This linguistic phenomenon poses challenges for Automatic Speech Recognition (ASR) systems.<n>textbfSwitchLingua is the first large-scale multilingual and multi-ethnic code-switching dataset.
arXiv Detail & Related papers (2025-05-30T05:54:46Z) - LAMA-UT: Language Agnostic Multilingual ASR through Orthography Unification and Language-Specific Transliteration [25.693176812512196]
We introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT)<n>LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data.<n>Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS.
arXiv Detail & Related papers (2024-12-19T10:39:08Z) - Enhancing Code-switched Text-to-Speech Synthesis Capability in Large Language Models with only Monolingual Corpora [13.891322931352649]
We propose a Code-Switched Large Language Model (CS-LLM) to enhance the code-switched text-to-speech synthesis (CS TTS) capability.<n>Specifically, we begin by enhancing the multilingual speech processing ability of LLMs through multilingual speech recognition and synthesis tasks.<n>We develop an effective code-switched (CS) data construction strategy that splits and splits words from different monolingual speech corpora to equip LLMs with improved CS TTS ability.
arXiv Detail & Related papers (2024-09-17T08:11:07Z) - Soft Language Clustering for Multilingual Model Pre-training [57.18058739931463]
We propose XLM-P, which contextually retrieves prompts as flexible guidance for encoding instances conditionally.
Our XLM-P enables (1) lightweight modeling of language-invariant and language-specific knowledge across languages, and (2) easy integration with other multilingual pre-training methods.
arXiv Detail & Related papers (2023-06-13T08:08:08Z) - Enhancing Cross-lingual Natural Language Inference by Soft Prompting
with Multilingual Verbalizer [52.46740830977898]
Cross-lingual natural language inference is a fundamental problem in cross-lingual language understanding.
We propose a novel Soft prompt learning framework with the Multilingual Verbalizer (SoftMV) for XNLI.
arXiv Detail & Related papers (2023-05-22T06:31:29Z) - LAMASSU: Streaming Language-Agnostic Multilingual Speech Recognition and
Translation Using Neural Transducers [71.76680102779765]
Automatic speech recognition (ASR) and speech translation (ST) can both use neural transducers as the model structure.
We propose LAMASSU, a streaming language-agnostic multilingual speech recognition and translation model using neural transducers.
arXiv Detail & Related papers (2022-11-05T04:03:55Z) - Towards Zero-Shot Code-Switched Speech Recognition [44.76492452463019]
We seek to build effective code-switched (CS) automatic speech recognition systems (ASR) under the zero-shot setting.
We propose to simplify each monolingual module by allowing them to transcribe all speech segments indiscriminately with a monolingual script.
We apply this transliteration-based approach in an end-to-end differentiable neural network and demonstrate its efficacy for zero-shot CS ASR on Mandarin-English SEAME test sets.
arXiv Detail & Related papers (2022-11-02T19:52:54Z) - LAE: Language-Aware Encoder for Monolingual and Multilingual ASR [87.74794847245536]
A novel language-aware encoder (LAE) architecture is proposed to handle both situations by disentangling language-specific information.
Experiments conducted on Mandarin-English code-switched speech suggest that the proposed LAE is capable of discriminating different languages in frame-level.
arXiv Detail & Related papers (2022-06-05T04:03:12Z) - FILTER: An Enhanced Fusion Method for Cross-lingual Language
Understanding [85.29270319872597]
We propose an enhanced fusion method that takes cross-lingual data as input for XLM finetuning.
During inference, the model makes predictions based on the text input in the target language and its translation in the source language.
To tackle this issue, we propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language.
arXiv Detail & Related papers (2020-09-10T22:42:15Z) - Phonological Features for 0-shot Multilingual Speech Synthesis [50.591267188664666]
We show that code-switching is possible for languages unseen during training, even within monolingual models.
We generate intelligible, code-switched speech in a new language at test time, including the approximation of sounds never seen in training.
arXiv Detail & Related papers (2020-08-06T18:25:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.