WavJourney: Compositional Audio Creation with Large Language Models
- URL: http://arxiv.org/abs/2307.14335v2
- Date: Sun, 26 Nov 2023 14:12:37 GMT
- Title: WavJourney: Compositional Audio Creation with Large Language Models
- Authors: Xubo Liu, Zhongkai Zhu, Haohe Liu, Yi Yuan, Meng Cui, Qiushi Huang,
Jinhua Liang, Yin Cao, Qiuqiang Kong, Mark D. Plumbley, Wenwu Wang
- Abstract summary: We present WavJourney, a novel framework that leverages Large Language Models to connect various audio models for audio creation.
WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions.
We show that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions.
- Score: 38.39551216587242
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite breakthroughs in audio generation models, their capabilities are
often confined to domain-specific conditions such as speech transcriptions and
audio captions. However, real-world audio creation aims to generate harmonious
audio containing various elements such as speech, music, and sound effects with
controllable conditions, which is challenging to address using existing audio
generation systems. We present WavJourney, a novel framework that leverages
Large Language Models (LLMs) to connect various audio models for audio
creation. WavJourney allows users to create storytelling audio content with
diverse audio elements simply from textual descriptions. Specifically, given a
text instruction, WavJourney first prompts LLMs to generate an audio script
that serves as a structured semantic representation of audio elements. The
audio script is then converted into a computer program, where each line of the
program calls a task-specific audio generation model or computational operation
function. The computer program is then executed to obtain a compositional and
interpretable solution for audio creation. Experimental results suggest that
WavJourney is capable of synthesizing realistic audio aligned with
textually-described semantic, spatial and temporal conditions, achieving
state-of-the-art results on text-to-audio generation benchmarks. Additionally,
we introduce a new multi-genre story benchmark. Subjective evaluations
demonstrate the potential of WavJourney in crafting engaging storytelling audio
content from text. We further demonstrate that WavJourney can facilitate
human-machine co-creation in multi-round dialogues. To foster future research,
the code and synthesized audio are available at:
https://audio-agi.github.io/WavJourney_demopage/.
Related papers
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and Composition [72.22243595269389]
We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with generated audio.
arXiv Detail & Related papers (2024-10-04T11:40:53Z) - Audiobox: Unified Audio Generation with Natural Language Prompts [37.39834044113061]
This paper presents Audiobox, a unified model based on flow-matching that is capable of generating various audio modalities.
We design description-based and example-based prompting to enhance controllability and unify speech and sound generation paradigms.
Audiobox sets new benchmarks on speech and sound generation and unlocks new methods for generating audio with novel vocal and acoustic styles.
arXiv Detail & Related papers (2023-12-25T22:24:49Z) - LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT [65.69648099999439]
Generative Pre-trained Transformer (GPT) models have achieved remarkable performance on various natural language processing tasks.
We propose LauraGPT, a novel unified audio-and-text GPT-based LLM for audio recognition, understanding, and generation.
arXiv Detail & Related papers (2023-10-07T03:17:59Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - AudioLM: a Language Modeling Approach to Audio Generation [59.19364975706805]
We introduce AudioLM, a framework for high-quality audio generation with long-term consistency.
We show how existing audio tokenizers provide different trade-offs between reconstruction quality and long-term structure.
We demonstrate how our approach extends beyond speech by generating coherent piano music continuations.
arXiv Detail & Related papers (2022-09-07T13:40:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.