Related papers: Learning Musical Representations for Music Performance Question Answering

Learning Musical Representations for Music Performance Question Answering

URL: http://arxiv.org/abs/2502.06710v1
Date: Mon, 10 Feb 2025 17:41:57 GMT
Title: Learning Musical Representations for Music Performance Question Answering
Authors: Xingjian Diao, Chunhui Zhang, Tingxuan Wu, Ming Cheng, Zhongyu Ouyang, Weiyi Wu, Jiang Gui,
Abstract summary: multimodal learning methods are incapable of dealing with fundamental problems within the music performances.<n>Our primary backbone is designed to incorporate multimodal interactions within the context of music data.<n>Our experiments show state-of-the-art effects on the Music AVQA datasets.
Score: 10.912207282129753
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Music performances are representative scenarios for audio-visual modeling. Unlike common scenarios with sparse audio, music performances continuously involve dense audio signals throughout. While existing multimodal learning methods on the audio-video QA demonstrate impressive capabilities in general scenarios, they are incapable of dealing with fundamental problems within the music performances: they underexplore the interaction between the multimodal signals in performance and fail to consider the distinctive characteristics of instruments and music. Therefore, existing methods tend to answer questions regarding musical performances inaccurately. To bridge the above research gaps, (i) given the intricate multimodal interconnectivity inherent to music data, our primary backbone is designed to incorporate multimodal interactions within the context of music; (ii) to enable the model to learn music characteristics, we annotate and release rhythmic and music sources in the current music datasets; (iii) for time-aware audio-visual modeling, we align the model's music predictions with the temporal dimension. Our experiments show state-of-the-art effects on the Music AVQA datasets. Our code is available at https://github.com/xid32/Amuse.

Related papers

Music Flamingo: Scaling Music Understanding in Audio Language Models [98.94537017112704]
Music Flamingo is a novel large audio-language model designed to advance music understanding in foundational audio models.<n> MF-Skills is a dataset labeled through a multi-stage pipeline that yields rich captions and question-answer pairs covering harmony, structure, timbre, lyrics, and cultural context.<n>We introduce a post-training recipe: we first cold-start with MF-Think, a novel chain-of-thought dataset grounded in music theory, followed by GRPO-based reinforcement learning with custom rewards.
arXiv Detail & Related papers (2025-11-13T13:21:09Z)
Learning Music Audio Representations With Limited Data [10.843118411238034]
We investigate the behavior of several music audio representation models under limited-data learning regimes.<n>We consider music models with various architectures, training paradigms, and input durations, and train them on data collections ranging from 5 to 8,000 minutes long.<n>We evaluate the learned representations on various music information retrieval tasks and analyze their robustness to noise.
arXiv Detail & Related papers (2025-05-09T13:39:53Z)
MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music. To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation) Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z)
Music Auto-Tagging with Robust Music Representation Learned via Domain Adversarial Training [18.71152526968065]
Existing models in Music Information Retrieval (MIR) struggle with real-world noise such as environmental and speech sounds in multimedia content. This study proposes a method inspired by speech-related tasks to enhance music auto-tagging performance in noisy settings.
arXiv Detail & Related papers (2024-01-27T06:56:51Z)
Self-Supervised Contrastive Learning for Robust Audio-Sheet Music Retrieval Systems [3.997809845676912]
We show that self-supervised contrastive learning can mitigate the scarcity of annotated data from real music content. We employ the snippet embeddings in the higher-level task of cross-modal piece identification. In this work, we observe that the retrieval quality improves from 30% up to 100% when real music data is present.
arXiv Detail & Related papers (2023-09-21T14:54:48Z)
MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response [42.73982391253872]
MusiLingo is a novel system for music caption generation and music-related query responses. We train it on an extensive music caption dataset and fine-tune it with instructional data. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs.
arXiv Detail & Related papers (2023-09-15T19:31:40Z)
MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE. It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description. We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z)
MATT: A Multiple-instance Attention Mechanism for Long-tail Music Genre Classification [1.8275108630751844]
Imbalanced music genre classification is a crucial task in the Music Information Retrieval (MIR) field. Most of the existing models are designed for class-balanced music datasets. We propose a novel mechanism named Multi-instance Attention (MATT) to boost the performance for identifying tail classes.
arXiv Detail & Related papers (2022-09-09T03:52:44Z)
Quantized GAN for Complex Music Generation from Dance Videos [48.196705493763986]
We present Dance2Music-GAN (D2M-GAN), a novel adversarial multi-modal framework that generates musical samples conditioned on dance videos. Our proposed framework takes dance video frames and human body motion as input, and learns to generate music samples that plausibly accompany the corresponding input.
arXiv Detail & Related papers (2022-04-01T17:53:39Z)
Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks. weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track. We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z)
MusicBERT: Symbolic Music Understanding with Large-Scale Pre-Training [97.91071692716406]
Symbolic music understanding refers to the understanding of music from the symbolic data. MusicBERT is a large-scale pre-trained model for music understanding.
arXiv Detail & Related papers (2021-06-10T10:13:05Z)
Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective. The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone. The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)

This list is automatically generated from the titles and abstracts of the papers in this site.