Fugu-MT 論文翻訳(概要): BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

論文の概要: BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

arxiv url: http://arxiv.org/abs/2509.26514v1
Date: Tue, 30 Sep 2025 16:52:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:45:00.216513
Title: BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
Title（参考訳）: BatonVoice:LLMの言語情報を用いた制御可能な音声合成を支援する演算子フレームワーク
Authors: Yue Wang, Ruotian Ma, Xingyu Chen, Zhengliang Shi, Wanshun Chen, Huang Liu, Jiadi Yao, Qu Yang, Qingxuan Jiang, Fanghua Ye, Juntao Li, Min Zhang, Zhaopeng Tu, Xiaolong Li, Linus,
Abstract要約: 音声生成から命令理解を分離する「操作主義」に着想を得た新しいパラダイムを提案する。本稿では,LLMが導体として機能するフレームワークであるBatonVoiceを紹介し,ユーザの指示を理解する。別個のTSモデルである「オーケストラ」は、これらの特徴から音声を生成する。
参考スコア（独自算出の注目度）: 84.59993864748195
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model's ability to follow text instructions for controllable Text-to-Speech~(TTS). To address this, we propose a new paradigm inspired by ``operationalism'' that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a ``conductor'', understanding user instructions and generating a textual ``plan'' -- explicit vocal features (e.g., pitch, energy). A separate TTS model, the ``orchestra'', then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.
Abstract（参考訳）: 大規模言語モデル(LLM)の台頭は、音声合成を顕著な応用として、マルチモデルモデルを再構築している。しかし、既存のアプローチはしばしばこれらのモデルの言語知能を過小評価し、典型的にはその強力な命令追従能力の活用に失敗する。この制限は、制御可能なText-to-Speech~(TTS)のテキスト命令に従うモデルの能力を妨げます。そこで本稿では,「操作主義」に着想を得た新たなパラダイムを提案する。 BatonVoice は LLM が ``conductor'' として機能し、ユーザの指示を理解し、テキストの ``plan'' -- 明示的な声質(例えば、ピッチ、エネルギ)を生成するフレームワークである。別個のTSモデルである `orchestra'' がこれらの特徴から音声を生成する。そこで我々は,このタスクに特化して訓練されたTSモデルであるBatonTTSを開発した。実験の結果,BatonVoiceは制御可能で感情的な音声合成において高い性能を示し,オープン・ソース・クローズド・ソース・ベースラインよりも優れていた。特に,本手法は,後学習中に見つからない言語に対して,特徴制御能力を正確に適用することにより,目覚ましいゼロショット言語間一般化を可能にする。このことは,LLMの言語的知性をより効果的に解放できることを示す。

論文の概要: BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs

関連論文リスト