HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
- URL: http://arxiv.org/abs/2401.05698v2
- Date: Mon, 1 Apr 2024 07:19:40 GMT
- Title: HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
- Authors: Licai Sun, Zheng Lian, Bin Liu, Jianhua Tao,
- Abstract summary: Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE)
HiCMAE is a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data.
Our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods.
- Score: 38.924509275928884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio-Visual Emotion Recognition (AVER) has garnered increasing attention in recent years for its critical role in creating emotion-ware intelligent machines. Previous efforts in this area are dominated by the supervised learning paradigm. Despite significant progress, supervised learning is meeting its bottleneck due to the longstanding data scarcity issue in AVER. Motivated by recent advances in self-supervised learning, we propose Hierarchical Contrastive Masked Autoencoder (HiCMAE), a novel self-supervised framework that leverages large-scale self-supervised pre-training on vast unlabeled audio-visual data to promote the advancement of AVER. Following prior arts in self-supervised audio-visual representation learning, HiCMAE adopts two primary forms of self-supervision for pre-training, namely masked data modeling and contrastive learning. Unlike them which focus exclusively on top-layer representations while neglecting explicit guidance of intermediate layers, HiCMAE develops a three-pronged strategy to foster hierarchical audio-visual feature learning and improve the overall quality of learned representations. To verify the effectiveness of HiCMAE, we conduct extensive experiments on 9 datasets covering both categorical and dimensional AVER tasks. Experimental results show that our method significantly outperforms state-of-the-art supervised and self-supervised audio-visual methods, which indicates that HiCMAE is a powerful audio-visual emotion representation learner. Codes and models will be publicly available at https://github.com/sunlicai/HiCMAE.
Related papers
- SVFAP: Self-supervised Video Facial Affect Perceiver [42.16505961654868]
Motivated by the recent success of self-supervised learning in computer vision, this paper introduces a self-supervised approach, termed Self-supervised Video Facial Affect Perceiver (SVFAP)
To address the dilemma faced by supervised methods, SVFAP leverages masked video autoencoding to perform self-supervised pre-training on massive unlabeled facial videos.
To verify the effectiveness of our method, we conduct experiments on nine datasets spanning three downstream tasks, including dynamic facial expression recognition, dimensional emotion recognition, and personality recognition.
arXiv Detail & Related papers (2023-12-31T07:44:05Z) - Unsupervised Representations Improve Supervised Learning in Speech
Emotion Recognition [1.3812010983144798]
This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments.
In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data.
Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
arXiv Detail & Related papers (2023-09-22T08:54:06Z) - Self-Supervised Learning for Audio-Based Emotion Recognition [1.7598252755538808]
Self-supervised learning is a family of methods which can learn despite a scarcity of supervised labels.
We have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality.
We find that self-supervised learning consistently improves the performance of the model across all metrics.
arXiv Detail & Related papers (2023-07-23T14:40:50Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - Weakly-supervised HOI Detection via Prior-guided Bi-level Representation
Learning [66.00600682711995]
Human object interaction (HOI) detection plays a crucial role in human-centric scene understanding and serves as a fundamental building-block for many vision tasks.
One generalizable and scalable strategy for HOI detection is to use weak supervision, learning from image-level annotations only.
This is inherently challenging due to ambiguous human-object associations, large search space of detecting HOIs and highly noisy training signal.
We develop a CLIP-guided HOI representation capable of incorporating the prior knowledge at both image level and HOI instance level, and adopt a self-taught mechanism to prune incorrect human-object associations.
arXiv Detail & Related papers (2023-03-02T14:41:31Z) - Exploring The Role of Mean Teachers in Self-supervised Masked
Auto-Encoders [64.03000385267339]
Masked image modeling (MIM) has become a popular strategy for self-supervised learning(SSL) of visual representations with Vision Transformers.
We present a simple SSL method, the Reconstruction-Consistent Masked Auto-Encoder (RC-MAE) by adding an EMA teacher to MAE.
RC-MAE converges faster and requires less memory usage than state-of-the-art self-distillation methods during pre-training.
arXiv Detail & Related papers (2022-10-05T08:08:55Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - The Devil is in the Frequency: Geminated Gestalt Autoencoder for
Self-Supervised Visual Pre-Training [13.087987450384036]
We present a new Masked Image Modeling (MIM), termed Geminated Autoencoder (Ge$2$-AE) for visual pre-training.
Specifically, we equip our model with geminated decoders in charge of reconstructing image contents from both pixel and frequency space.
arXiv Detail & Related papers (2022-04-18T09:22:55Z) - MS$^2$L: Multi-Task Self-Supervised Learning for Skeleton Based Action
Recognition [36.74293548921099]
We integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects.
Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition.
arXiv Detail & Related papers (2020-10-12T11:09:44Z) - Does Visual Self-Supervision Improve Learning of Speech Representations
for Emotion Recognition? [63.564385139097624]
This work investigates visual self-supervision via face reconstruction to guide the learning of audio representations.
We show that a multi-task combination of the proposed visual and audio self-supervision is beneficial for learning richer features.
We evaluate our learned audio representations for discrete emotion recognition, continuous affect recognition and automatic speech recognition.
arXiv Detail & Related papers (2020-05-04T11:33:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.