Related papers: CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation

CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation

URL: http://arxiv.org/abs/2511.11104v1
Date: Fri, 14 Nov 2025 09:29:10 GMT
Title: CLARITY: Contextual Linguistic Adaptation and Accent Retrieval for Dual-Bias Mitigation in Text-to-Speech Generation
Authors: Crystal Min Hui Poon, Pai Chet Ng, Xiaoxiao Miao, Immanuel Jun Kai Loh, Bowen Zhang, Haoyu Song, Ian Mcloughlin,
Abstract summary: Two biases persist in instruction-guided text-to-speech research: accent bias and linguistic bias.<n>We present Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis (CLARITY)<n>CLARITY is a backbone-agnostic framework that addresses these biases through dual-signal optimization.
Score: 15.730246391986002
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Instruction-guided text-to-speech (TTS) research has reached a maturity level where excellent speech generation quality is possible on demand, yet two coupled biases persist: accent bias, where models default to dominant phonetic patterns, and linguistic bias, where dialect-specific lexical and cultural cues are ignored. These biases are interdependent, as authentic accent generation requires both accent fidelity and localized text. We present Contextual Linguistic Adaptation and Retrieval for Inclusive TTS sYnthesis (CLARITY), a backbone-agnostic framework that addresses these biases through dual-signal optimization: (i) contextual linguistic adaptation that localizes input text to the target dialect, and (ii) retrieval-augmented accent prompting (RAAP) that supplies accent-consistent speech prompts. Across twelve English accents, CLARITY improves accent accuracy and fairness while maintaining strong perceptual quality.

Related papers

Quantifying Speaker Embedding Phonological Rule Interactions in Accented Speech Synthesis [44.55147169458465]
We analyze the interaction between speaker embeddings and linguistically motivated phonological rules in accented speech synthesis.<n>Experiments show that combining rules with embeddings yields more authentic accents.<n>Our findings highlight rules as a lever for accent control and a framework for evaluating disentanglement in speech generation.
arXiv Detail & Related papers (2026-01-20T19:25:33Z)
Listening or Reading? Evaluating Speech Awareness in Chain-of-Thought Speech-to-Text Translation [12.571782794778182]
Chain-of-Thought (CoT) prompting has been introduced, with the expectation that jointly accessing speech and transcription will overcome these issues.<n>We find that it largely mirrors cascaded behavior, relying mainly on transcripts while barely leveraging speech.<n>Simple training interventions, such as adding Direct S2TT data or noisy transcript injection, enhance robustness and increase speech attribution.
arXiv Detail & Related papers (2025-10-03T15:42:38Z)
Towards Inclusive Communication: A Unified Framework for Generating Spoken Language from Sign, Lip, and Audio [52.859261069569165]
We propose the first unified framework capable of handling diverse combinations of sign language, lip movements, and audio for spoken-language text generation.<n>We focus on three main objectives: (i) designing a unified, modality-agnostic architecture capable of effectively processing heterogeneous inputs; (ii) exploring the underexamined synergy among modalities, particularly the role of lip movements as non-manual cues in sign language comprehension; and (iii) achieving performance on par with or better than state-of-the-art models specialized for individual tasks.
arXiv Detail & Related papers (2025-08-28T06:51:42Z)
Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
Languages in Multilingual Speech Foundation Models Align Both Phonetically and Semantically [58.019484208091534]
Cross-lingual alignment in pretrained language models (LMs) has enabled efficient transfer in text-based LMs.<n>It remains an open question whether findings and methods from text-based cross-lingual alignment apply to speech.
arXiv Detail & Related papers (2025-05-26T07:21:20Z)
Language translation, and change of accent for speech-to-speech task using diffusion model [16.436756456803774]
Speech-to-speech translation (S2ST) aims to convert spoken input in one language to spoken output in another.<n>We propose a unified approach for simultaneous speech translation and change of accent.
arXiv Detail & Related papers (2025-05-04T23:23:46Z)
Transfer the linguistic representations from TTS to accent conversion with non-parallel data [7.376032484438044]
Accent conversion aims to convert the accent of a source speech to a target accent, preserving the speaker's identity. This paper introduces a novel non-autoregressive framework for accent conversion that learns accent-agnostic linguistic representations and employs them to convert the accent in the source speech.
arXiv Detail & Related papers (2024-01-07T16:39:34Z)
Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT) Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework. We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z)
DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech [30.110058338155675]
Cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres. We propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style. By combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis.
arXiv Detail & Related papers (2023-06-25T06:46:36Z)
Explicit Intensity Control for Accented Text-to-speech [65.35831577398174]
How to control the intensity of accent in the process of TTS is a very interesting research direction. Recent work design a speaker-versaadrial loss to disentangle the speaker and accent information, and then adjust the loss weight to control the accent intensity. This paper propose a new intuitive and explicit accent intensity control scheme for accented TTS.
arXiv Detail & Related papers (2022-10-27T12:23:41Z)
Cross-lingual Low Resource Speaker Adaptation Using Phonological Features [2.8080708404213373]
We train a language-agnostic multispeaker model conditioned on a set of phonologically derived features common across different languages. With as few as 32 and 8 utterances of target speaker data, we obtain high speaker similarity scores and naturalness comparable to the corresponding literature.
arXiv Detail & Related papers (2021-11-17T12:33:42Z)
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.