Related papers: LLark: A Multimodal Instruction-Following Language Model for Music

LLark: A Multimodal Instruction-Following Language Model for Music

URL: http://arxiv.org/abs/2310.07160v3
Date: Mon, 3 Jun 2024 03:35:01 GMT
Title: LLark: A Multimodal Instruction-Following Language Model for Music
Authors: Josh Gardner, Simon Durand, Daniel Stoller, Rachel M. Bittner,
Abstract summary: Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand. We present LLark, an instruction-tuned multimodal model for emphmusic understanding.
Score: 7.7033394687966865
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Music has a unique and complex structure which is challenging for both expert humans and existing AI systems to understand, and presents unique challenges relative to other forms of audio. We present LLark, an instruction-tuned multimodal model for \emph{music} understanding. We detail our process for dataset creation, which involves augmenting the annotations of diverse open-source music datasets and converting them to a unified instruction-tuning format. We propose a multimodal architecture for LLark, integrating a pretrained generative model for music with a pretrained language model. In evaluations on three types of tasks (music understanding, captioning, reasoning), we show that LLark matches or outperforms existing baselines in music understanding, and that humans show a high degree of agreement with its responses in captioning and reasoning tasks. LLark is trained entirely from open-source music data and models, and we make our training code available along with the release of this paper. Additional results and audio examples are at https://bit.ly/llark, and our source code is available at https://github.com/spotify-research/llark .

Related papers

Muse: Towards Reproducible Long-Form Song Generation with Fine-Grained Style Control [66.46754271097555]
We release a fully open-source system for long-form song generation with fine-grained style conditioning.<n>The dataset consists of 116k fully licensed synthetic songs with automatically generated lyrics and style descriptions.<n>We train Muse via single-stage supervised finetuning of a Qwen-based language model extended with discrete audio tokens.
arXiv Detail & Related papers (2026-01-07T14:40:48Z)
Music Flamingo: Scaling Music Understanding in Audio Language Models [98.94537017112704]
Music Flamingo is a novel large audio-language model designed to advance music understanding in foundational audio models.<n> MF-Skills is a dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context.<n>We introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards.
arXiv Detail & Related papers (2025-11-13T13:21:09Z)
Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning [69.78158549955384]
We introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions.<n>This approach generates verifiable sheet music questions in both textual and visual modalities.<n> Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music.
arXiv Detail & Related papers (2025-09-04T09:42:17Z)
Advancing the Foundation Model for Music Understanding [9.210248657997687]
We introduce a unified foundation model named MuFun for holistic music understanding.<n>Our model features a novel architecture that jointly processes instrumental and lyrical content.<n>We also propose a new benchmark for multi-faceted music understanding called MuCUE.
arXiv Detail & Related papers (2025-08-02T03:33:47Z)
ThinkSound: Chain-of-Thought Reasoning in Multimodal Large Language Models for Audio Generation and Editing [52.33281620699459]
ThinkSound is a novel framework that leverages Chain-of-Thought (CoT) reasoning to enable stepwise, interactive audio generation and editing for videos.<n>Our approach decomposes the process into three complementary stages: semantically coherent, interactive object-centric refinement through precise user interactions, and targeted editing guided by natural language instructions.<n> Experiments demonstrate that ThinkSound achieves state-of-the-art performance in video-to-audio generation across both audio metrics and CoT metrics.
arXiv Detail & Related papers (2025-06-26T16:32:06Z)
Kimi-Audio Technical Report [67.69331679172303]
Kimi-Audio is an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation.
arXiv Detail & Related papers (2025-04-25T15:31:46Z)
UniMuMo: Unified Text, Music and Motion Generation [57.72514622935806]
We introduce UniMuMo, a unified multimodal model capable of taking arbitrary text, music, and motion data as input conditions to generate outputs across all three modalities. By converting music, motion, and text into token-based representation, our model bridges these modalities through a unified encoder-decoder transformer architecture.
arXiv Detail & Related papers (2024-10-06T16:04:05Z)
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models [11.834712543531756]
MuChoMusic is a benchmark for evaluating music understanding in multimodal language models focused on audio. It comprises 1,187 multiple-choice questions, all validated by human annotators, on 644 music tracks sourced from two publicly available music datasets. We evaluate five open-source models and identify several pitfalls, including an over-reliance on the language modality.
arXiv Detail & Related papers (2024-08-02T15:34:05Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
WikiMuTe: A web-sourced dataset of semantic descriptions for music audio [7.4327407361824935]
We present WikiMuTe, a new and open dataset containing rich semantic descriptions of music. The data is sourced from Wikipedia's rich catalogue of articles covering musical works. We train a model that jointly learns text and audio representations and performs cross-modal retrieval.
arXiv Detail & Related papers (2023-12-14T18:38:02Z)
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response [42.73982391253872]
MusiLingo is a novel system for music caption generation and music-related query responses. We train it on an extensive music caption dataset and fine-tune it with instructional data. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
arXiv Detail & Related papers (2023-09-15T19:31:40Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens. Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z)
AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head [82.69233563811487]
Large language models (LLMs) have exhibited remarkable capabilities across a variety of domains and tasks, challenging our understanding of learning and cognition. We propose a multi-modal AI system named AudioGPT, which complements LLMs with foundation models to process complex audio information.
arXiv Detail & Related papers (2023-04-25T17:05:38Z)
Contrastive Audio-Language Learning for Music [13.699088044513562]
MusCALL is a framework for Music Contrastive Audio-Language Learning. Our approach consists of a dual-encoder architecture that learns the alignment between pairs of music audio and descriptive sentences.
arXiv Detail & Related papers (2022-08-25T16:55:15Z)
Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z)
Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap. We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.