Memo2496: Expert-Annotated Dataset and Dual-View Adaptive Framework for Music Emotion Recognition
- URL: http://arxiv.org/abs/2512.13998v2
- Date: Wed, 17 Dec 2025 05:09:17 GMT
- Title: Memo2496: Expert-Annotated Dataset and Dual-View Adaptive Framework for Music Emotion Recognition
- Authors: Qilin Li, C. L. Philip Chen, Tong Zhang,
- Abstract summary: Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift.<n>This work presents two primary contributions to address these issues.
- Score: 57.869107847456725
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Music Emotion Recogniser (MER) research faces challenges due to limited high-quality annotated datasets and difficulties in addressing cross-track feature drift. This work presents two primary contributions to address these issues. Memo2496, a large-scale dataset, offers 2496 instrumental music tracks with continuous valence arousal labels, annotated by 30 certified music specialists. Annotation quality is ensured through calibration with extreme emotion exemplars and a consistency threshold of 0.25, measured by Euclidean distance in the valence arousal space. Furthermore, the Dual-view Adaptive Music Emotion Recogniser (DAMER) is introduced. DAMER integrates three synergistic modules: Dual Stream Attention Fusion (DSAF) facilitates token-level bidirectional interaction between Mel spectrograms and cochleagrams via cross attention mechanisms; Progressive Confidence Labelling (PCL) generates reliable pseudo labels employing curriculum-based temperature scheduling and consistency quantification using Jensen Shannon divergence; and Style Anchored Memory Learning (SAML) maintains a contrastive memory queue to mitigate cross-track feature drift. Extensive experiments on the Memo2496, 1000songs, and PMEmo datasets demonstrate DAMER's state-of-the-art performance, improving arousal dimension accuracy by 3.43%, 2.25%, and 0.17%, respectively. Ablation studies and visualisation analyses validate each module's contribution. Both the dataset and source code are publicly available.
Related papers
- AGSP-DSA: An Adaptive Graph Signal Processing Framework for Robust Multimodal Fusion with Dynamic Semantic Alignment [18.39945426205332]
We introduce an Adaptive Graph Signal Processing with Dynamic Semantic Alignment (AGSP DSA) framework to perform robust multimodal data fusion over heterogeneous sources.<n>The experimental outcomes on three benchmark datasets, including CMU-MOSEI, AVE, and MM-IMDB, show that AGSP-DSA performs as the state of the art.
arXiv Detail & Related papers (2026-01-26T15:35:03Z) - SG-XDEAT: Sparsity-Guided Cross-Dimensional and Cross-Encoding Attention with Target-Aware Conditioning in Tabular Learning [0.0]
We propose SG-XDEAT, a novel framework for supervised learning on tabular data.<n>At its core, SG-XDEAT employs a dual-stream encoder that decomposes each input feature into two parallel representations.<n>These dual representations are then propagated through a hierarchical stack of attention-based modules.
arXiv Detail & Related papers (2025-10-14T15:56:40Z) - A Study on the Data Distribution Gap in Music Emotion Recognition [7.281487567929003]
Music Emotion Recognition (MER) is a task deeply connected to human perception.<n>Prior studies tend to focus on specific musical styles rather than incorporating a diverse range of genres.<n>We address the task of recognizing emotion from audio content by investigating five datasets with dimensional emotion annotations.
arXiv Detail & Related papers (2025-10-06T10:57:05Z) - Towards Unified Music Emotion Recognition across Dimensional and Categorical Models [9.62904012066486]
One of the most significant challenges in Music Emotion Recognition (MER) comes from the fact that emotion labels can be heterogeneous across datasets.<n>We present a unified multitask learning framework that combines categorical and dimensional labels.<n>Our work makes a significant contribution to MER by allowing the combination of categorical and dimensional emotion labels in one unified framework.
arXiv Detail & Related papers (2025-02-06T11:20:22Z) - Personalized Dynamic Music Emotion Recognition with Dual-Scale Attention-Based Meta-Learning [15.506299212817034]
We propose a Dual-Scale Attention-Based Meta-Learning (DSAML) method for Dynamic Music Emotion Recognition (DMER)<n>Our method fuses features from a dual-scale feature extractor and captures both short and long-term dependencies.<n>Our objective and subjective experiments demonstrate that our method can achieve state-of-the-art performance in both traditional DMER and PDMER.
arXiv Detail & Related papers (2024-12-26T12:47:35Z) - Dual-Perspective Knowledge Enrichment for Semi-Supervised 3D Object
Detection [55.210991151015534]
We present a novel Dual-Perspective Knowledge Enrichment approach named DPKE for semi-supervised 3D object detection.
Our DPKE enriches the knowledge of limited training data, particularly unlabeled data, from two perspectives: data-perspective and feature-perspective.
arXiv Detail & Related papers (2024-01-10T08:56:07Z) - Hierarchical Audio-Visual Information Fusion with Multi-label Joint
Decoding for MER 2023 [51.95161901441527]
In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions.
Deep features extracted from foundation models are used as robust acoustic and visual representations of raw video.
Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.
arXiv Detail & Related papers (2023-09-11T03:19:10Z) - Black-box Unsupervised Domain Adaptation with Bi-directional
Atkinson-Shiffrin Memory [59.51934126717572]
Black-box unsupervised domain adaptation (UDA) learns with source predictions of target data without accessing either source data or source models during training.
We propose BiMem, a bi-directional memorization mechanism that learns to remember useful and representative information to correct noisy pseudo labels on the fly.
BiMem achieves superior domain adaptation performance consistently across various visual recognition tasks such as image classification, semantic segmentation and object detection.
arXiv Detail & Related papers (2023-08-25T08:06:48Z) - Object Segmentation by Mining Cross-Modal Semantics [68.88086621181628]
We propose a novel approach by mining the Cross-Modal Semantics to guide the fusion and decoding of multimodal features.
Specifically, we propose a novel network, termed XMSNet, consisting of (1) all-round attentive fusion (AF), (2) coarse-to-fine decoder (CFD), and (3) cross-layer self-supervision.
arXiv Detail & Related papers (2023-05-17T14:30:11Z) - The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked
Emotions, Cross-Cultural Humour, and Personalisation [69.13075715686622]
MuSe 2023 is a set of shared tasks addressing three different contemporary multimodal affect and sentiment analysis problems.
MuSe 2023 seeks to bring together a broad audience from different research communities.
arXiv Detail & Related papers (2023-05-05T08:53:57Z) - Supervision by Registration and Triangulation for Landmark Detection [70.13440728689231]
We present Supervision by Registration and Triangulation (SRT), an unsupervised approach that utilizes unlabeled multi-view video to improve the accuracy and precision of landmark detectors.
Being able to utilize unlabeled data enables our detectors to learn from massive amounts of unlabeled data freely available.
arXiv Detail & Related papers (2021-01-25T02:48:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.