Fugu-MT 論文翻訳(概要): Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

論文の概要: Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

arxiv url: http://arxiv.org/abs/2605.28063v1
Date: Wed, 27 May 2026 07:15:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.83659
Title: Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts
Title（参考訳）: 自由形テキストプロンプからの合成音声と音声の統一合成
Authors: Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen, Ruihua Song,
Abstract要約: フリーフォームテキスト・プロンプト・トゥ・ユニファイド・オーディオ・ジェネレーションという新しいタスクを導入する。 PlanAudioは統合された自己回帰型LLMベースのフレームワークである。音声・音響・合成のシナリオで評価を行う。
参考スコア（独自算出の注目度）: 20.986457042343684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.
Abstract（参考訳）: 音声生成は大きな進歩を遂げているが、音声と音声が自然に合成されるような統合された音声を合成することは依然として課題である。現行の手法では、細粒度の相互作用を捕捉できない不整合パイプラインに依存するか、構造化された入力と外部のテキスト書き換えを必要とするため、自由形式のテキストプロンプトの柔軟性が制限される。本稿では,制約のない自然言語から音声,音声,合成音声を直接合成することを目的とした,フリーフォームテキスト・プロンプト・トゥ・ユニファイド・オーディオ生成という新しいタスクを紹介する。この課題に対処するため,我々はPlanAudioを提案する。まず、従来のテキストエンコーダではなく、固有のLLM推論機能を活用することで、モデルアーキテクチャを単純化する。第二に、高いレベルの意味理解と低レベルの音響合成を橋渡しする暗黙の計画メカニズムであるセマンティック・ラテント・チェーン・オブ・プリート機構を導入する。さらに、合成音声シナリオを評価するための特別なベンチマークであるPlanAudio-Benchを作成する。音声・音響・合成のシナリオで評価を行う。結果は、PlanAudioが既存のパイプラインと統一されたベースラインを上回っ、単一のシナリオ用に設計されたモデルと競合することを示した。分析の結果,他のCoT機構よりもセマンティック潜在CoTの方が優れていることが明らかになり,連続多シナリオトレーニングカリキュラムの重要性が強調された。

論文の概要: Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

関連論文リスト