Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of   Study in Tabletop Role-Playing Games Soundtracks
        - URL: http://arxiv.org/abs/2411.03948v1
- Date: Wed, 06 Nov 2024 14:29:49 GMT
- Title: Long-Form Text-to-Music Generation with Adaptive Prompts: A Case of   Study in Tabletop Role-Playing Games Soundtracks
- Authors: Felipe Marra, Lucas N. Ferreira, 
- Abstract summary: This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time.
We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model.
- Score: 0.5524804393257919
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract:   This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness. 
 
      
        Related papers
        - ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language   Models for Audio Generation and Editing [52.33281620699459]
 ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
 arXiv  Detail & Related papers  (2025-06-26T16:32:06Z)
- Audio-Agent: Leveraging LLMs For Audio Generation, Editing and   Composition [72.22243595269389]
 We introduce Audio-Agent, a framework for audio generation, editing and composition based on text or video inputs.
In our method, we utilize a pre-trained TTA diffusion network as the audio generation agent to work in tandem with GPT-4.
For video-to-audio (VTA) tasks, most existing methods require training a timestamp detector to synchronize video events with the generated audio.
 arXiv  Detail & Related papers  (2024-10-04T11:40:53Z)
- Enriching Music Descriptions with a Finetuned-LLM and Metadata for   Text-to-Music Retrieval [7.7464988473650935]
 Text-to-Music Retrieval plays a pivotal role in content discovery within extensive music databases.
This paper proposes an improved Text-to-Music Retrieval model, denoted as TTMR++.
 arXiv  Detail & Related papers  (2024-10-04T09:33:34Z)
- CoLLAP: Contrastive Long-form Language-Audio Pretraining with Musical   Temporal Structure Augmentation [17.41880273107978]
 Contrastive Long-form Language-Audio Pretraining (textbfCoLLAP)
We propose Contrastive Long-form Language-Audio Pretraining (textbfCoLLAP) to significantly extend the perception window for both the input audio (up to 5 minutes) and the language descriptions (exceeding 250 words)
We collect 51.3K audio-text pairs derived from the large-scale AudioSet training dataset, where the average audio length reaches 288 seconds.
 arXiv  Detail & Related papers  (2024-10-03T07:46:51Z)
- C3LLM: Conditional Multimodal Content Generation Using Large Language   Models [66.11184017840688]
 We introduce C3LLM, a novel framework combining three tasks of video-to-audio, audio-to-text, and text-to-audio together.
C3LLM adapts the Large Language Model (LLM) structure as a bridge for aligning different modalities.
Our method combines the previous tasks of audio understanding, video-to-audio generation, and text-to-audio generation together into one unified model.
 arXiv  Detail & Related papers  (2024-05-25T09:10:12Z)
- Tango 2: Aligning Diffusion-based Text-to-Audio Generations through   Direct Preference Optimization [70.13218512896032]
 Generation of audio from text prompts is an important aspect of such processes in the music and film industry.
Our hypothesis is focusing on how these aspects of audio generation could improve audio generation performance in the presence of limited data.
We synthetically create a preference dataset where each prompt has a winner audio output and some loser audio outputs for the diffusion model to learn from.
 arXiv  Detail & Related papers  (2024-04-15T17:31:22Z)
- WavJourney: Compositional Audio Creation with Large Language Models [38.39551216587242]
 We present WavJourney, a novel framework that leverages Large Language Models to connect various audio models for audio creation.
WavJourney allows users to create storytelling audio content with diverse audio elements simply from textual descriptions.
We show that WavJourney is capable of synthesizing realistic audio aligned with textually-described semantic, spatial and temporal conditions.
 arXiv  Detail & Related papers  (2023-07-26T17:54:04Z)
- LyricWhiz: Robust Multilingual Zero-shot Lyrics Transcription by   Whispering to ChatGPT [48.28624219567131]
 We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription method.
We use Whisper, a weakly supervised robust speech recognition model, and GPT-4, today's most performant chat-based large language model.
Our experiments show that LyricWhiz significantly reduces Word Error Rate compared to existing methods in English.
 arXiv  Detail & Related papers  (2023-06-29T17:01:51Z)
- Noise2Music: Text-conditioned Music Generation with Diffusion Models [73.74580231353684]
 We introduce Noise2Music, where a series of diffusion models is trained to generate high-quality 30-second music clips from text prompts.
We find that the generated audio is not only able to faithfully reflect key elements of the text prompt such as genre, tempo, instruments, mood, and era.
Pretrained large language models play a key role in this story -- they are used to generate paired text for the audio of the training set and to extract embeddings of the text prompts ingested by the diffusion models.
 arXiv  Detail & Related papers  (2023-02-08T07:27:27Z)
- MusicLM: Generating Music From Text [24.465880798449735]
 We introduce MusicLM, a model generating high-fidelity music from text descriptions.
MusicLM casts the process of conditional music generation as a hierarchical sequence-to-sequence modeling task.
Our experiments show that MusicLM outperforms previous systems both in audio quality and adherence to the text description.
 arXiv  Detail & Related papers  (2023-01-26T18:58:53Z)
- AudioGen: Textually Guided Audio Generation [116.57006301417306]
 We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
 arXiv  Detail & Related papers  (2022-09-30T10:17:05Z)
- Genre-conditioned Acoustic Models for Automatic Lyrics Transcription of
  Polyphonic Music [73.73045854068384]
 We propose to transcribe the lyrics of polyphonic music using a novel genre-conditioned network.
The proposed network adopts pre-trained model parameters, and incorporates the genre adapters between layers to capture different genre peculiarities for lyrics-genre pairs.
Our experiments show that the proposed genre-conditioned network outperforms the existing lyrics transcription systems.
 arXiv  Detail & Related papers  (2022-04-07T09:15:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.