Hierarchical Audio-Visual Information Fusion with Multi-label Joint
Decoding for MER 2023
- URL: http://arxiv.org/abs/2309.07925v1
- Date: Mon, 11 Sep 2023 03:19:10 GMT
- Title: Hierarchical Audio-Visual Information Fusion with Multi-label Joint
Decoding for MER 2023
- Authors: Haotian Wang, Yuxuan Xi, Hang Chen, Jun Du, Yan Song, Qing Wang,
Hengshun Zhou, Chenxi Wang, Jiefeng Ma, Pengfei Hu, Ya Jiang, Shi Cheng, Jie
Zhang and Yuzhe Weng
- Abstract summary: In this paper, we propose a novel framework for recognizing both discrete and dimensional emotions.
Deep features extracted from foundation models are used as robust acoustic and visual representations of raw video.
Our final system achieves state-of-the-art performance and ranks third on the leaderboard on MER-MULTI sub-challenge.
- Score: 51.95161901441527
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we propose a novel framework for recognizing both discrete and
dimensional emotions. In our framework, deep features extracted from foundation
models are used as robust acoustic and visual representations of raw video.
Three different structures based on attention-guided feature gathering (AFG)
are designed for deep feature fusion. Then, we introduce a joint decoding
structure for emotion classification and valence regression in the decoding
stage. A multi-task loss based on uncertainty is also designed to optimize the
whole process. Finally, by combining three different structures on the
posterior probability level, we obtain the final predictions of discrete and
dimensional emotions. When tested on the dataset of multimodal emotion
recognition challenge (MER 2023), the proposed framework yields consistent
improvements in both emotion classification and valence regression. Our final
system achieves state-of-the-art performance and ranks third on the leaderboard
on MER-MULTI sub-challenge.
Related papers
- Machine Learning Framework for Audio-Based Content Evaluation using MFCC, Chroma, Spectral Contrast, and Temporal Feature Engineering [0.0]
We construct a dataset containing audio samples from music covers on YouTube along with the audio of the original song, and sentiment scores derived from user comments.
Our approach involves extensive pre-processing, segmenting audio signals into 30-second windows, and extracting high-dimensional feature representations.
We train regression models to predict sentiment scores on a 0-100 scale, achieving root mean square error (RMSE) values of 3.420, 5.482, 2.783, and 4.212, respectively.
arXiv Detail & Related papers (2024-10-31T20:26:26Z) - Leveraging Contrastive Learning and Self-Training for Multimodal Emotion Recognition with Limited Labeled Samples [18.29910296652917]
We present our submission solutions for the Semi-Supervised Learning Sub-Challenge (MER2024-SEMI)
This challenge tackles the issue of limited annotated data in emotion recognition.
Our proposed method is validated to be effective on the MER2024-SEMI Challenge, achieving a weighted average F-score of 88.25% and ranking 6th on the leaderboard.
arXiv Detail & Related papers (2024-08-23T11:33:54Z) - DAC: 2D-3D Retrieval with Noisy Labels via Divide-and-Conquer Alignment and Correction [26.164120380820307]
We propose a Divide-and-conquer 2D-3D cross-modal Alignment and Correction framework, which comprises Multimodal Dynamic Division (MDD) and Adaptive Alignment and Correction (AAC)
In AAC, samples in distinct subsets are exploited with different alignment strategies to fully enhance the semantic compactness and meanwhile over-fitting to noisy labels.
To evaluate the effectiveness in real-world scenarios, we introduce a challenging noisy benchmark, namely.
N200, which comprises 200k-level samples annotated with 1156 realistic noisy labels.
arXiv Detail & Related papers (2024-07-25T05:18:18Z) - MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion
Recognition [41.837538440839815]
We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition.
The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model.
In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer.
arXiv Detail & Related papers (2023-04-14T03:25:00Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Low-complexity deep learning frameworks for acoustic scene
classification [64.22762153453175]
We present low-complexity deep learning frameworks for acoustic scene classification (ASC)
The proposed frameworks can be separated into four main steps: Front-end spectrogram extraction, online data augmentation, back-end classification, and late fusion of predicted probabilities.
Our experiments conducted on DCASE 2022 Task 1 Development dataset have fullfiled the requirement of low-complexity and achieved the best classification accuracy of 60.1%.
arXiv Detail & Related papers (2022-06-13T11:41:39Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - Three Steps to Multimodal Trajectory Prediction: Modality Clustering,
Classification and Synthesis [54.249502356251085]
We present a novel insight along with a brand-new prediction framework.
Our proposed method surpasses state-of-the-art works even without introducing social and map information.
arXiv Detail & Related papers (2021-03-14T06:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.