Related papers: UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech

UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech

URL: http://arxiv.org/abs/2508.09767v2
Date: Tue, 16 Sep 2025 10:28:11 GMT
Title: UtterTune: LoRA-Based Target-Language Pronunciation Edit and Control in Multilingual Text-to-Speech
Authors: Shuhei Kato,
Abstract summary: UtterTune is a lightweight adaptation method that fine-tunes a multilingual text-to-speech system.<n>It enables the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech.
Score: 1.7462881580496206
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose UtterTune, a lightweight adaptation method that fine-tunes a multilingual text-to-speech (TTS) system based on a large language model (LLM) architecture, designed to enhance the controllability of pronunciation in a target language while preserving performance in others. While LLM architectures have enabled TTS models to achieve remarkable naturalness, accurately modeling grapheme-to-phoneme (G2P) mapping and prosody remains challenging, especially when the model omits an explicit G2P module and directly processes minimally encoded text (e.g., byte-pair encoding). UtterTune leverages low-rank adaptation to enable the control of segmental pronunciation and pitch accent at the phoneme level for Japanese speech, the target language in this paper, while maintaining naturalness and speaker similarity in a zero-shot setting. Objective and subjective evaluations confirm its effectiveness.

Related papers

BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs [84.59993864748195]
We propose a new paradigm inspired by operationalism'' that decouples instruction understanding from speech generation.<n>We introduce BatonVoice, a framework where an LLM acts as a conductor'', understanding user instructions.<n>A separate TTS model, the orchestra'', then generates the speech from these features.
arXiv Detail & Related papers (2025-09-30T16:52:14Z)
Unseen Speaker and Language Adaptation for Lightweight Text-To-Speech with Adapters [3.7987175642397832]
We investigate cross-lingual Text-To-Speech synthesis through the lens of adapters.<n>Results demonstrate the effectiveness of adapters in learning language-specific and speaker-specific information.<n>The paper also provides insights into the impact of adapter placement, configuration and the number of speakers used.
arXiv Detail & Related papers (2025-08-25T13:14:57Z)
Generalized Multilingual Text-to-Speech Generation with Language-Aware Style Adaptation [18.89091877062589]
LanStyleTTS is a non-autoregressive, language-aware style adaptive TTS framework.<n>It supports a unified multilingual TTS model capable of producing accurate and high-quality speech without the need to train language-specific models.
arXiv Detail & Related papers (2025-04-11T06:12:57Z)
Description-based Controllable Text-to-Speech with Cross-Lingual Voice Control [14.145510487599932]
We propose a novel controllable text-to-speech (TTS) method with cross-lingual control capability. We combine a TTS model trained on the target language with a description control model trained on another language, which maps input text descriptions to the conditional features of the TTS model. Experiments on English and Japanese TTS demonstrate that our method achieves high naturalness and controllability for both languages.
arXiv Detail & Related papers (2024-09-26T01:08:09Z)
StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech [13.713209707407712]
StyleSpeech is a novel Text-to-Speech(TTS) system that enhances the naturalness and accuracy of synthesized speech. Building upon existing TTS technologies, StyleSpeech incorporates a unique Style Decorator structure that enables deep learning models to simultaneously learn style and phoneme features. LoRA allows efficient adaptation of style features in pre-trained models.
arXiv Detail & Related papers (2024-08-27T00:37:07Z)
An Initial Investigation of Language Adaptation for TTS Systems under Low-resource Scenarios [76.11409260727459]
This paper explores the language adaptation capability of ZMM-TTS, a recent SSL-based multilingual TTS system. We demonstrate that the similarity in phonetics between the pre-training and target languages, as well as the language category, affects the target language's adaptation performance.
arXiv Detail & Related papers (2024-06-13T08:16:52Z)
Textless Unit-to-Unit training for Many-to-Many Multilingual Speech-to-Speech Translation [65.13824257448564]
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST)
arXiv Detail & Related papers (2023-08-03T15:47:04Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation. AudioPaLM fuses text-based and speech-based language models. It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech. It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.