Do Foundational Audio Encoders Understand Music Structure?
- URL: http://arxiv.org/abs/2512.17209v1
- Date: Fri, 19 Dec 2025 03:42:47 GMT
- Title: Do Foundational Audio Encoders Understand Music Structure?
- Authors: Keisuke Toyama, Zhi Zhong, Akira Takahashi, Shusuke Takahashi, Yuki Mitsufuji,
- Abstract summary: We conduct experiments on 11 types of foundational audio encoders (FAEs) to investigate how these factors affect music structure analysis (MSA) performance.<n>Our results demonstrate that FAEs using selfsupervised learning with masked language modeling on music data are particularly effective for MSA.
- Score: 32.88009059868699
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In music information retrieval (MIR) research, the use of pretrained foundational audio encoders (FAEs) has recently become a trend. FAEs pretrained on large amounts of music and audio data have been shown to improve performance on MIR tasks such as music tagging and automatic music transcription. However, their use for music structure analysis (MSA) remains underexplored. Although many open-source FAE models are available, only a small subset has been examined for MSA, and the impact of factors such as learning methods, training data, and model context length on MSA performance remains unclear. In this study, we conduct comprehensive experiments on 11 types of FAEs to investigate how these factors affect MSA performance. Our results demonstrate that FAEs using selfsupervised learning with masked language modeling on music data are particularly effective for MSA. These findings pave the way for future research in MSA.
Related papers
- Sound and Music Biases in Deep Music Transcription Models: A Systematic Analysis [6.87202900256721]
This work investigates the musical dimension -- specifically, variations in genre, dynamics, and polyphony levels.<n>We introduce the MDS corpus, comprising three distinct subsets -- Genre, (2) Random, and (3) MAEtest.<n>We evaluate the performance of several state-of-the-art AMT systems on the MDS corpus using both traditional information-retrieval and musically-informed performance metrics.
arXiv Detail & Related papers (2025-12-16T17:12:26Z) - Segment Transformer: AI-Generated Music Detection via Music Structural Analysis [1.7034813545878587]
We aim to improve the accuracy of AIGM detection by analyzing the structural patterns of music segments.<n>Specifically, to extract musical features from short audio clips, we integrated various pre-trained models.<n>For long audio, we developed a segment transformer that divides music into segments and learns inter-segment relationships.
arXiv Detail & Related papers (2025-09-10T04:56:40Z) - Towards an AI Musician: Synthesizing Sheet Music Problems for Musical Reasoning [69.78158549955384]
We introduce a novel approach that treats core music theory rules, such as those governing beats and intervals, as programmatic functions.<n>This approach generates verifiable sheet music questions in both textual and visual modalities.<n> Evaluation results on SSMR-Bench highlight the key role reasoning plays in interpreting sheet music.
arXiv Detail & Related papers (2025-09-04T09:42:17Z) - CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following [12.638115555721257]
CMI-Bench is a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks.<n>Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models.<n>We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc.
arXiv Detail & Related papers (2025-06-14T00:18:44Z) - A Benchmark and Robustness Study of In-Context-Learning with Large Language Models in Music Entity Detection [0.046040036610482664]
We provide a novel dataset of user-generated metadata and conduct a benchmark and a study using recent language models with in-context learning (ICL)<n>Our results indicate that LLMs in the ICL setting yield higher performance than SLMs.
arXiv Detail & Related papers (2024-12-16T15:11:03Z) - Foundation Models for Music: A Survey [77.77088584651268]
Foundations models (FMs) have profoundly impacted diverse sectors, including music.
This comprehensive review examines state-of-the-art (SOTA) pre-trained models and foundation models in music.
arXiv Detail & Related papers (2024-08-26T15:13:14Z) - Perceptual Musical Features for Interpretable Audio Tagging [2.1730712607705485]
This study explores the relevance of interpretability in the context of automatic music tagging.
We constructed a workflow that incorporates three different information extraction techniques.
We conducted experiments on two datasets, namely the MTG-Jamendo dataset and the GTZAN dataset.
arXiv Detail & Related papers (2023-12-18T14:31:58Z) - Deep Feature Learning for Medical Acoustics [78.56998585396421]
The purpose of this paper is to compare different learnables in medical acoustics tasks.
A framework has been implemented to classify human respiratory sounds and heartbeats in two categories, i.e. healthy or affected by pathologies.
arXiv Detail & Related papers (2022-08-05T10:39:37Z) - Multitask learning for instrument activation aware music source
separation [83.30944624666839]
We propose a novel multitask structure to investigate using instrument activation information to improve source separation performance.
We investigate our system on six independent instruments, a more realistic scenario than the three instruments included in the widely-used MUSDB dataset.
The results show that our proposed multitask model outperforms the baseline Open-Unmix model on the mixture of Mixing Secrets and MedleyDB dataset.
arXiv Detail & Related papers (2020-08-03T02:35:00Z) - Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with
Visual Computing for Improved Music Video Analysis [91.3755431537592]
This thesis combines audio-analysis with computer vision to approach Music Information Retrieval (MIR) tasks from a multi-modal perspective.
The main hypothesis of this work is based on the observation that certain expressive categories such as genre or theme can be recognized on the basis of the visual content alone.
The experiments are conducted for three MIR tasks Artist Identification, Music Genre Classification and Cross-Genre Classification.
arXiv Detail & Related papers (2020-02-01T17:57:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.