Related papers: MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response

URL: http://arxiv.org/abs/2309.08730v3
Date: Tue, 2 Apr 2024 13:35:59 GMT
Title: MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response
Authors: Zihao Deng, Yinghao Ma, Yudong Liu, Rongchen Guo, Ge Zhang, Wenhu Chen, Wenhao Huang, Emmanouil Benetos,
Abstract summary: MusiLingo is a novel system for music caption generation and music-related query responses. We train it on an extensive music caption dataset and fine-tune it with instructional data. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
Score: 42.73982391253872
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Large Language Models (LLMs) have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.

Related papers

MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models [46.761820987130065]
MusiXQA is the first comprehensive dataset for evaluating and advancing MLLMs in music sheet understanding.<n>We develop Phi-3-MusiX, an MLLM fine-tuned on our dataset, achieving significant performance gains over GPT-based methods.
arXiv Detail & Related papers (2025-06-28T20:46:47Z)
Learning Musical Representations for Music Performance Question Answering [10.912207282129753]
multimodal learning methods are incapable of dealing with fundamental problems within the music performances. Our primary backbone is designed to incorporate multimodal interactions within the context of music data. Our experiments show state-of-the-art effects on the Music AVQA datasets.
arXiv Detail & Related papers (2025-02-10T17:41:57Z)
Can Impressions of Music be Extracted from Thumbnail Images? [20.605634973566573]
There is a scarcity of large-scale publicly available datasets consisting of music data and their corresponding natural language descriptions known as music captions. We propose a method for generating music caption data that incorporates non-musical aspects inferred from music thumbnail images. We created a dataset with approximately 360,000 captions containing non-musical aspects and trained a music retrieval model.
arXiv Detail & Related papers (2025-01-05T11:51:38Z)
Enriching Music Descriptions with a Finetuned-LLM and Metadata for Text-to-Music Retrieval [7.7464988473650935]
Text-to-Music Retrieval plays a pivotal role in content discovery within extensive music databases. This paper proposes an improved Text-to-Music Retrieval model, denoted as TTMR++.
arXiv Detail & Related papers (2024-10-04T09:33:34Z)
MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions [69.9122231800796]
We present MMTrail, a large-scale multi-modality video-language dataset incorporating more than 20M trailer clips with visual captions. We propose a systemic captioning framework, achieving various modality annotations with more than 27.1k hours of trailer videos. Our dataset potentially paves the path for fine-grained large multimodal-language model training.
arXiv Detail & Related papers (2024-07-30T16:43:24Z)
Futga: Towards Fine-grained Music Understanding through Temporally-enhanced Generative Augmentation [18.12051302437043]
We propose a model equipped with fined-grained music understanding capabilities through learning from generative augmentation with temporal compositions. We leverage existing music caption datasets and large language models (LLMs) to synthesize fine-grained music captions with structural descriptions and time boundaries for full-length songs.
arXiv Detail & Related papers (2024-07-29T22:53:32Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
SongComposer: A Large Language Model for Lyric and Melody Composition in Song Generation [88.33522730306674]
SongComposer could understand and generate melodies and lyrics in symbolic song representations. We resort to symbolic song representation, the mature and efficient way humans designed for music. With extensive experiments, SongComposer demonstrates superior performance in lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation.
arXiv Detail & Related papers (2024-02-27T16:15:28Z)
Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning [37.76488341368786]
Text-to-music generation (T2M-Gen) faces a major obstacle due to the scarcity of large-scale publicly available music datasets with natural language captions. We propose the Music Understanding LLaMA (MU-LLaMA), capable of answering music-related questions and generating captions for music files. We present a methodology for generating question-answer pairs from existing audio captioning datasets and introduce the MusicQA dataset.
arXiv Detail & Related papers (2023-08-22T08:43:33Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
Contrastive Audio-Language Learning for Music [13.699088044513562]
MusCALL is a framework for Music Contrastive Audio-Language Learning. Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences.
arXiv Detail & Related papers (2022-08-25T16:55:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.