Fugu-MT 論文翻訳(概要): AudioStory: Generating Long-Form Narrative Audio with Large Language Models

論文の概要: AudioStory: Generating Long-Form Narrative Audio with Large Language Models

arxiv url: http://arxiv.org/abs/2508.20088v1
Date: Wed, 27 Aug 2025 17:55:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-28 19:07:41.728779
Title: AudioStory: Generating Long-Form Narrative Audio with Large Language Models
Title（参考訳）: AudioStory: 大規模言語モデルによる長期物語音声の生成
Authors: Yuxin Guo, Teng Wang, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou, Ying Shan,
Abstract要約: AudioStoryは、大きな言語モデルとテキストからオーディオシステムを統合して、構造化された長文の音声物語を生成するフレームワークである。 LLMを用いて複雑な物語クエリを時間順に並べたサブタスクに分解する。広汎な実験により,単一音声生成と物語音声生成の両方においてAudioStoryの優位性が,指示追従能力と音声忠実性の両方において,TTAベースラインを上回った。
参考スコア（独自算出の注目度）: 87.23256929520743
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory
Abstract（参考訳）: 短い音声クリップを合成する上でのテキスト・トゥ・オーディオ(TTA)生成の進歩は、時間的コヒーレンスと構成的推論を必要とする長めの物語音声に苦慮している。このギャップに対処するために,大規模言語モデル(LLM)とTTAシステムを統合する統合フレームワークであるAudioStoryを提案する。 AudioStoryは強力な命令追従推論機能を持っている。複雑な物語のクェリを時間的に順序づけられたサブタスクに分解するためにLLMを使用しており、コヒーレントなシーン遷移と感情的なトーン一貫性を可能にする。 1)分離ブリッジ機構:AudioStoryは、LCMとディフューザの協調を2つの特別なコンポーネント、すなわち、イベント内セマンティックアライメントのためのブリッジクエリと、クロスイベントコヒーレンス保存のための残留クエリに分解する。 2) エンドツーエンドトレーニング: 単一のエンドツーエンドフレームワーク内で命令理解とオーディオ生成を統合することで、AudioStoryは、コンポーネント間の相乗効果を高めながら、モジュラートレーニングパイプラインの必要性を排除します。さらに,アニメーションサウンドスケープやナチュラルサウンドナラティブといった多様な領域を網羅したベンチマークAudioStory-10Kを構築した。広汎な実験により,単一音声生成と物語音声生成の両方においてAudioStoryの優位性が,指示追従能力と音声忠実性の両方において,TTAベースラインを上回った。私たちのコードはhttps://github.com/TencentARC/AudioStoryで利用可能です。

論文の概要: AudioStory: Generating Long-Form Narrative Audio with Large Language Models

関連論文リスト