Attention Driven Fusion for Multi-Modal Emotion Recognition
- URL: http://arxiv.org/abs/2009.10991v2
- Date: Sat, 10 Oct 2020 22:25:20 GMT
- Title: Attention Driven Fusion for Multi-Modal Emotion Recognition
- Authors: Darshana Priyasad, Tharindu Fernando, Simon Denman, Clinton Fookes,
Sridha Sridharan
- Abstract summary: We present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification.
We use a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN.
For text processing, we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in parallel where cross attention is introduced to infer the N-gram level correlations.
- Score: 39.295892047505816
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning has emerged as a powerful alternative to hand-crafted methods
for emotion recognition on combined acoustic and text modalities. Baseline
systems model emotion information in text and acoustic modes independently
using Deep Convolutional Neural Networks (DCNN) and Recurrent Neural Networks
(RNN), followed by applying attention, fusion, and classification. In this
paper, we present a deep learning-based approach to exploit and fuse text and
acoustic data for emotion classification. We utilize a SincNet layer, based on
parameterized sinc functions with band-pass filters, to extract acoustic
features from raw audio followed by a DCNN. This approach learns filter banks
tuned for emotion recognition and provides more effective features compared to
directly applying convolutions over the raw speech signal. For text processing,
we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in
parallel where cross attention is introduced to infer the N-gram level
correlations on hidden representations received from the Bi-RNN. Following
existing state-of-the-art, we evaluate the performance of the proposed system
on the IEMOCAP dataset. Experimental results indicate that the proposed system
outperforms existing methods, achieving 3.5% improvement in weighted accuracy.
Related papers
- Unsupervised Representations Improve Supervised Learning in Speech
Emotion Recognition [1.3812010983144798]
This study proposes an innovative approach that integrates self-supervised feature extraction with supervised classification for emotion recognition from small audio segments.
In the preprocessing step, we employed a self-supervised feature extractor, based on the Wav2Vec model, to capture acoustic features from audio data.
Then, the output featuremaps of the preprocessing step are fed to a custom designed Convolutional Neural Network (CNN)-based model to perform emotion classification.
arXiv Detail & Related papers (2023-09-22T08:54:06Z) - HCAM -- Hierarchical Cross Attention Model for Multi-modal Emotion
Recognition [41.837538440839815]
We propose a hierarchical cross-attention model (HCAM) approach to multi-modal emotion recognition.
The input to the model consists of two modalities, i) audio data, processed through a learnable wav2vec approach and, ii) text data represented using a bidirectional encoder representations from transformers (BERT) model.
In order to incorporate contextual knowledge and the information across the two modalities, the audio and text embeddings are combined using a co-attention layer.
arXiv Detail & Related papers (2023-04-14T03:25:00Z) - Spiking Neural Network Decision Feedback Equalization [70.3497683558609]
We propose an SNN-based equalizer with a feedback structure akin to the decision feedback equalizer (DFE)
We show that our approach clearly outperforms conventional linear equalizers for three different exemplary channels.
The proposed SNN with a decision feedback structure enables the path to competitive energy-efficient transceivers.
arXiv Detail & Related papers (2022-11-09T09:19:15Z) - Streaming Audio-Visual Speech Recognition with Alignment Regularization [69.30185151873707]
We propose a streaming AV-ASR system based on a hybrid connectionist temporal classification ( CTC)/attention neural network architecture.
The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 dataset in an offline and online setup.
arXiv Detail & Related papers (2022-11-03T20:20:47Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention [2.8017924048352576]
We propose a simple yet efficient neural network architecture to exploit both acoustic and lexical informationfrom speech.
The proposed framework using multi-scale con-volutional layers (MSCNN) to obtain both audio and text hid-den representations.
Extensive experiments show that the proposed modeloutperforms previous state-of-the-art methods on IEMOCAPdataset.
arXiv Detail & Related papers (2021-06-08T06:45:42Z) - ScalingNet: extracting features from raw EEG data for emotion
recognition [4.047737925426405]
We propose a novel convolutional layer allowing to adaptively extract effective data-driven spectrogram-like features from raw EEG signals.
The proposed neural network architecture based on the scaling layer, references as ScalingNet, has achieved the state-of-the-art result across the established DEAP benchmark dataset.
arXiv Detail & Related papers (2021-02-07T08:54:27Z) - Emotional EEG Classification using Connectivity Features and
Convolutional Neural Networks [81.74442855155843]
We introduce a new classification system that utilizes brain connectivity with a CNN and validate its effectiveness via the emotional video classification.
The level of concentration of the brain connectivity related to the emotional property of the target video is correlated with classification performance.
arXiv Detail & Related papers (2021-01-18T13:28:08Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - A Novel Deep Learning Architecture for Decoding Imagined Speech from EEG [2.4063592468412267]
We present a novel architecture that employs deep neural network (DNN) for classifying the words "in" and "cooperate"
Nine EEG channels, which best capture the underlying cortical activity, are chosen using common spatial pattern.
We have achieved accuracies comparable to the state-of-the-art results.
arXiv Detail & Related papers (2020-03-19T00:57:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.