AudioStory: Generating Long-Form Narrative Audio with Large Language Models
- URL: http://arxiv.org/abs/2508.20088v2
- Date: Thu, 02 Oct 2025 18:00:27 GMT
- Title: AudioStory: Generating Long-Form Narrative Audio with Large Language Models
- Authors: Yuxin Guo, Teng Wang, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou, Ying Shan,
- Abstract summary: AudioStory is a framework that integrates large language models with text-to-audio systems to generate structured, long-form audio narratives.<n>It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues.<n>Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity.
- Score: 87.23256929520743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory
Related papers
- Echo: Towards Advanced Audio Comprehension via Audio-Interleaved Reasoning [39.264735719707154]
Current efforts replicate text-based reasoning by contextualizing audio content through a one-time encoding.<n>We propose audio-interleaved reasoning to break through this bottleneck.<n>We present Echo, a LALM capable of dynamically re-listening to audio in demand during reasoning.
arXiv Detail & Related papers (2026-02-12T13:06:34Z) - Step-Audio 2 Technical Report [120.58375054866815]
Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.<n>By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding.
arXiv Detail & Related papers (2025-07-22T14:23:55Z) - ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z) - Long-Video Audio Synthesis with Multi-Agent Collaboration [20.332328741375363]
LVAS-Agent is a novel framework that emulates professional dubbing through collaborative role.<n>Our approach decomposes long-video synthesis into four steps including scene segmentation, script generation, sound design and audio synthesis.<n>Central innovations include a discussion-correction mechanism for scene/script refinement and a generation-retrieval loop for temporal-semantic alignment.
arXiv Detail & Related papers (2025-03-13T07:58:23Z) - Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.<n>In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.<n>For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - WavJourney: Compositional Audio Creation with Large Language Models [38.39551216587242]
We present WavJourney, a novel framework that leverages Large Language Models to connect various audio models for audio creation.
WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions.
We show that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions.
arXiv Detail & Related papers (2023-07-26T17:54:04Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.