Audio-visual Attentive Fusion for Continuous Emotion Recognition
- URL: http://arxiv.org/abs/2107.01175v1
- Date: Fri, 2 Jul 2021 16:28:55 GMT
- Title: Audio-visual Attentive Fusion for Continuous Emotion Recognition
- Authors: Su Zhang, Yi Ding, Ziquan Wei, Cuntai Guan
- Abstract summary: We propose an audio-visual spatial-temporal deep neural network with: (1) a visual block containing a pretrained 2D-CNN followed by a temporal convolutional network (TCN); (2) an aural block containing several parallel TCNs; and (3) a leader-follower attentive fusion block combining the audio-visual information.
- Score: 12.211342881526276
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose an audio-visual spatial-temporal deep neural network with: (1) a
visual block containing a pretrained 2D-CNN followed by a temporal
convolutional network (TCN); (2) an aural block containing several parallel
TCNs; and (3) a leader-follower attentive fusion block combining the
audio-visual information. The TCN with large history coverage enables our model
to exploit spatial-temporal information within a much larger window length
(i.e., 300) than that from the baseline and state-of-the-art methods (i.e., 36
or 48). The fusion block emphasizes the visual modality while exploits the
noisy aural modality using the inter-modality attention mechanism. To make full
use of the data and alleviate over-fitting, cross-validation is carried out on
the training and validation set. The concordance correlation coefficient (CCC)
centering is used to merge the results from each fold. On the development set,
the achieved CCC is 0.410 for valence and 0.661 for arousal, which
significantly outperforms the baseline method with the corresponding CCC of
0.210 and 0.230 for valence and arousal, respectively. The code is available at
https://github.com/sucv/ABAW2.
Related papers
- SCVCNet: Sliding cross-vector convolution network for cross-task and
inter-individual-set EEG-based cognitive workload recognition [15.537230343119875]
This paper presents a generic approach for applying the cognitive workload recognizer by exploiting common electroencephalogram (EEG) patterns across different human-machine tasks and individual sets.
We propose a neural network called SCVCNet, which eliminates task- and individual-set-related interferences in EEGs by analyzing finer-grained frequency structures in the power spectral densities.
arXiv Detail & Related papers (2023-09-21T13:06:30Z) - Connecting Multi-modal Contrastive Representations [50.26161419616139]
Multi-modal Contrastive Representation learning aims to encode different modalities into a semantically shared space.
This paper proposes a novel training-efficient method for learning MCR without paired data called Connecting Multi-modal Contrastive Representations (C-MCR)
C-MCR achieves audio-visual state-of-the-art performance on audio-image retrieval, audio-visual source localization, and counterfactual audio-image recognition tasks.
arXiv Detail & Related papers (2023-05-22T09:44:39Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Decoupled Mixup for Generalized Visual Recognition [71.13734761715472]
We propose a novel "Decoupled-Mixup" method to train CNN models for visual recognition.
Our method decouples each image into discriminative and noise-prone regions, and then heterogeneously combines these regions to train CNN models.
Experiment results show the high generalization performance of our method on testing data that are composed of unseen contexts.
arXiv Detail & Related papers (2022-10-26T15:21:39Z) - Continuous Emotion Recognition using Visual-audio-linguistic
information: A Technical Report for ABAW3 [15.077019278082673]
Cross-modal co-attention model for continuous emotion recognition.
Visual, audio, and linguistic blocks are used to learn the features of the multimodal input.
Cross-validation is carried out on the training and validation set.
arXiv Detail & Related papers (2022-03-24T12:18:06Z) - TC-Net: Triple Context Network for Automated Stroke Lesion Segmentation [0.5482532589225552]
We propose a new network, Triple Context Network (TC-Net), with the capture of spatial contextual information as the core.
Our network is evaluated on the open dataset ATLAS, achieving the highest score of 0.594, Hausdorff distance of 27.005 mm, and average symmetry surface distance of 7.137 mm.
arXiv Detail & Related papers (2022-02-28T11:12:16Z) - Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral content using a CNN model.
Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z) - Attention-based Neural Beamforming Layers for Multi-channel Speech
Recognition [17.009051842682677]
We propose a 2D Conv-Attention module which combines convolution neural networks with attention for beamforming.
We apply self- and cross-attention to explicitly model the correlations within and between the input channels.
The results show a relative improvement of 3.8% in WER by the proposed model over the baseline neural beamformer.
arXiv Detail & Related papers (2021-05-12T19:32:24Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.