MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
- URL: http://arxiv.org/abs/2309.08730v3
- Date: Tue, 2 Apr 2024 13:35:59 GMT
- Title: MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
- Authors: Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos,
- Abstract summary: MusiLingo is a novel system for music caption generation and music-related query responses.
We train it on an extensive music caption dataset and fine-tune it with instructional data.
Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
- Score: 42.73982391253872
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.
Related papers
- Learning Musical Representations for Music Performance Question Answering [10.912207282129753]
multimodal learning methods are incapable of dealing with fundamental problems within the music performances.
Our primary backbone is designed to incorporate multimodal interactions within the context of music data.
Our experiments show state-of-the-art effects on the Music AVQA datasets.
arXiv Detail & Related papers (2025-02-10T17:41:57Z) - Can Impressions of Music be Extracted from Thumbnail Images? [20.605634973566573]
There is a scarcity of large-scale publicly available datasets consisting of music data and their corresponding natural language descriptions known as music captions.
We propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images.
We created a dataset with approximately 360,000 captions containing non-musical aspects and trained a music retrieval model.
arXiv Detail & Related papers (2025-01-05T11:51:38Z) - Enriching Music Descriptions with a Finetuned-LLM and Metadata for Text-to-Music Retrieval [7.7464988473650935]
Text-to-Music Retrieval plays a pivotal role in content discovery within extensive music databases.
This paper proposes an improved Text-to-Music Retrieval model, denoted as TTMR++.
arXiv Detail & Related papers (2024-10-04T09:33:34Z) - MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions.
We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos.
Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z) - Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation [18.12051302437043]
We propose a model equipped with fined-grained music understanding capabilities through learning from generative augmentation with temporal compositions.
We leverage existing music caption datasets and large language models (LLMs) to synthesize fine-grained music captions with structural descriptions and time boundaries for full-length songs.
arXiv Detail & Related papers (2024-07-29T22:53:32Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - SongComposer: A Large Language Model for Lyric and Melody Composition in
Song Generation [88.33522730306674]
SongComposer could understand and generate melodies and lyrics in symbolic song representations.
We resort to symbolic song representation, the mature and efficient way humans designed for music.
With extensive experiments, SongComposer demonstrates superior performance in lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation.
arXiv Detail & Related papers (2024-02-27T16:15:28Z) - Music Understanding LLaMA: Advancing Text-to-Music Generation with
Question Answering and Captioning [37.76488341368786]
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions.
We propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files.
We present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA dataset.
arXiv Detail & Related papers (2023-08-22T08:43:33Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.