Advancing the Foundation Model for Music Understanding
- URL: http://arxiv.org/abs/2508.01178v1
- Date: Sat, 02 Aug 2025 03:33:47 GMT
- Title: Advancing the Foundation Model for Music Understanding
- Authors: Yi Jiang, Wei Wang, Xianwen Guo, Huiyun Liu, Hanrui Wang, Youri Xu, Haoqi Gu, Zhongqian Xie, Chuanjiang Luo,
- Abstract summary: We introduce a unified foundation model named MuFun for holistic music understanding.<n>Our model features a novel architecture that jointly processes instrumental and lyrical content.<n>We also propose a new benchmark for multi-faceted music understanding called MuCUE.
- Score: 9.210248657997687
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The field of Music Information Retrieval (MIR) is fragmented, with specialized models excelling at isolated tasks. In this work, we challenge this paradigm by introducing a unified foundation model named MuFun for holistic music understanding. Our model features a novel architecture that jointly processes instrumental and lyrical content, and is trained on a large-scale dataset covering diverse tasks such as genre classification, music tagging, and question answering. To facilitate robust evaluation, we also propose a new benchmark for multi-faceted music understanding called MuCUE (Music Comprehensive Understanding Evaluation). Experiments show our model significantly outperforms existing audio large language models across the MuCUE tasks, demonstrating its state-of-the-art effectiveness and generalization ability.
Related papers
- TOMI: Transforming and Organizing Music Ideas for Multi-Track Compositions with Full-Song Structure [8.721294663967305]
We introduce TOMI (Transforming and Organizing Music Ideas) as a novel approach in deep music generation.<n>We represent a multi-track composition process via a sparse, four-dimensional space characterized by clips (short audio or MIDI segments), sections (temporal positions), tracks (instrument layers) and transformations.<n>Our model is capable of generating multi-track electronic music with full-song structure, and we further integrate the TOMI-based model with the REAPER digital audio workstation.
arXiv Detail & Related papers (2025-06-29T05:15:41Z) - Universal Music Representations? Evaluating Foundation Models on World Music Corpora [65.72891334156706]
Foundation models have revolutionized music information retrieval, but questions remain about their ability to generalize.<n>This paper presents a comprehensive evaluation of five state-of-the-art audio foundation models across six musical corpora.
arXiv Detail & Related papers (2025-06-20T15:06:44Z) - Music's Multimodal Complexity in AVQA: Why We Need More than General Multimodal LLMs [24.215093830868813]
Music Audio-Visual Question Answering presents unique challenges with its continuous, densely layered audio-visual content.<n>This paper identifies that specialized input processing, architectures incorporating dedicated spatial-temporal designs, and music-specific modeling strategies are critical for success in this domain.
arXiv Detail & Related papers (2025-05-27T02:31:24Z) - A Survey of Foundation Models for Music Understanding [60.83532699497597]
This work is one of the early reviews of the intersection of AI techniques and music understanding.
We investigated, analyzed, and tested recent large-scale music foundation models in respect of their music comprehension abilities.
arXiv Detail & Related papers (2024-09-15T03:34:14Z) - Foundation Models for Music: A Survey [77.77088584651268]
Foundations models (FMs) have profoundly impacted diverse sectors, including music.
This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music.
arXiv Detail & Related papers (2024-08-26T15:13:14Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training [74.32603591331718]
We propose an acoustic Music undERstanding model with large-scale self-supervised Training (MERT), which incorporates teacher models to provide pseudo labels in the masked language modelling (MLM) style acoustic pre-training.<n> Experimental results indicate that our model can generalise and perform well on 14 music understanding tasks and attain state-of-the-art (SOTA) overall scores.
arXiv Detail & Related papers (2023-05-31T18:27:43Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Modeling Musical Structure with Artificial Neural Networks [0.0]
I explore the application of artificial neural networks to different aspects of musical structure modeling.
I show how a connectionist model, the Gated Autoencoder (GAE), can be employed to learn transformations between musical fragments.
I propose a special predictive training of the GAE, which yields a representation of polyphonic music as a sequence of intervals.
arXiv Detail & Related papers (2020-01-06T18:35:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.