MERGE -- A Bimodal Dataset for Static Music Emotion Recognition
- URL: http://arxiv.org/abs/2407.06060v1
- Date: Mon, 8 Jul 2024 16:01:04 GMT
- Title: MERGE -- A Bimodal Dataset for Static Music Emotion Recognition
- Authors: Pedro Lima Louro, Hugo Redinho, Ricardo Santos, Ricardo Malheiro, Renato Panda, Rui Pedro Paiva,
- Abstract summary: This article proposes three new audio, lyrics, and bimodal Music Emotion Recognition research datasets, collectively called MERGE, created using a semi-automatic approach.
The obtained results confirm the viability of the proposed datasets, achieving the best overall result of 79.21% F1-score for bimodal classification using a deep neural network.
- Score: 0.5339846068056558
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The Music Emotion Recognition (MER) field has seen steady developments in recent years, with contributions from feature engineering, machine learning, and deep learning. The landscape has also shifted from audio-centric systems to bimodal ensembles that combine audio and lyrics. However, a severe lack of public and sizeable bimodal databases has hampered the development and improvement of bimodal audio-lyrics systems. This article proposes three new audio, lyrics, and bimodal MER research datasets, collectively called MERGE, created using a semi-automatic approach. To comprehensively assess the proposed datasets and establish a baseline for benchmarking, we conducted several experiments for each modality, using feature engineering, machine learning, and deep learning methodologies. In addition, we propose and validate fixed train-validate-test splits. The obtained results confirm the viability of the proposed datasets, achieving the best overall result of 79.21% F1-score for bimodal classification using a deep neural network.
Related papers
- Audio-Guided Fusion Techniques for Multimodal Emotion Analysis [2.7013910991626213]
We propose a solution for the semi-supervised learning track (MER-SEMI) in MER2024.
We fine-tuned video and text feature extractors, specifically CLIP-vit-large and Baichuan-13B, using labeled data.
We also propose an Audio-Guided Transformer (AGT) fusion mechanism, showing superior effectiveness in fusing both inter-channel and intra-channel information.
arXiv Detail & Related papers (2024-09-08T07:28:27Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network [58.82343017711883]
This paper investigates how to learn directly from unpaired phone sequences and speech utterances.
GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence.
In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance.
arXiv Detail & Related papers (2022-07-29T09:29:28Z) - BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping [19.071463356974387]
This work extends existing methods based on self-supervised learning by bootstrapping, proposes various encoder architectures, and explores the effects of using different pre-training datasets.
We present a novel training framework to come up with a hybrid audio representation, which combines handcrafted and data-driven learned audio features.
All the proposed representations were evaluated within the HEAR NeurIPS 2021 challenge for auditory scene classification and timestamp detection tasks.
arXiv Detail & Related papers (2022-06-24T02:26:40Z) - Speech Emotion Recognition with Co-Attention based Multi-level Acoustic
Information [21.527784717450885]
Speech Emotion Recognition aims to help the machine to understand human's subjective emotion from only audio information.
We propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module.
arXiv Detail & Related papers (2022-03-29T08:17:28Z) - Leveraging Uni-Modal Self-Supervised Learning for Multimodal
Audio-Visual Speech Recognition [23.239078852797817]
We leverage uni-modal self-supervised learning to promote the multimodal audio-visual speech recognition (AVSR)
In particular, we first train audio and visual encoders on a large-scale uni-modal dataset, then we integrate components of both encoders into a larger multimodal framework.
Our model is experimentally validated on both word-level and sentence-level AVSR tasks.
arXiv Detail & Related papers (2022-02-24T15:12:17Z) - A Study of Multimodal Person Verification Using Audio-Visual-Thermal
Data [4.149096351426994]
We study an approach to multimodal person verification using audio, visual, and thermal modalities.
We implement unimodal, bimodal, and trimodal verification systems using the state-of-the-art deep learning architectures.
arXiv Detail & Related papers (2021-10-23T04:41:03Z) - Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model [51.42415340921237]
We propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to fuse the two modalities (audio and textual)
We further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio.
arXiv Detail & Related papers (2021-07-04T08:35:20Z) - Few-Shot Named Entity Recognition: A Comprehensive Study [92.40991050806544]
We investigate three schemes to improve the model generalization ability for few-shot settings.
We perform empirical comparisons on 10 public NER datasets with various proportions of labeled data.
We create new state-of-the-art results on both few-shot and training-free settings.
arXiv Detail & Related papers (2020-12-29T23:43:16Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains.
This will also provide a means for evaluating algorithms specifically designed for music.
The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.