Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention
- URL: http://arxiv.org/abs/2106.04133v1
- Date: Tue, 8 Jun 2021 06:45:42 GMT
- Title: Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention
- Authors: Zixuan Peng, Yu Lu, Shengfeng Pan, Yunfeng Liu
- Abstract summary: We propose a simple yet efficient neural network architecture to exploit both acoustic and lexical informationfrom speech.
The proposed framework using multi-scale con-volutional layers (MSCNN) to obtain both audio and text hid-den representations.
Extensive experiments show that the proposed modeloutperforms previous state-of-the-art methods on IEMOCAPdataset.
- Score: 2.8017924048352576
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emotion recognition from speech is a challenging task. Re-cent advances in
deep learning have led bi-directional recur-rent neural network (Bi-RNN) and
attention mechanism as astandard method for speech emotion recognition,
extractingand attending multi-modal features - audio and text, and thenfusing
them for downstream emotion classification tasks. Inthis paper, we propose a
simple yet efficient neural networkarchitecture to exploit both acoustic and
lexical informationfrom speech. The proposed framework using multi-scale
con-volutional layers (MSCNN) to obtain both audio and text hid-den
representations. Then, a statistical pooling unit (SPU)is used to further
extract the features in each modality. Be-sides, an attention module can be
built on top of the MSCNN-SPU (audio) and MSCNN (text) to further improve the
perfor-mance. Extensive experiments show that the proposed modeloutperforms
previous state-of-the-art methods on IEMOCAPdataset with four emotion
categories (i.e., angry, happy, sadand neutral) in both weighted accuracy (WA)
and unweightedaccuracy (UA), with an improvement of 5.0% and 5.2% respectively
under the ASR setting.
Related papers
- EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - Speech Emotion Recognition with Global-Aware Fusion on Multi-scale
Feature Representation [5.20970006627454]
Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data.
Recent works mostly focus on using convolutional neural networks(CNNs) to learn local attention map on fixed-scale feature representation.
We propose a novel GLobal-Aware Multi-scale (GLAM) neural network to learn multi-scale feature representation with global-aware fusion module.
arXiv Detail & Related papers (2022-04-12T07:03:04Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Speech Emotion Recognition with Co-Attention based Multi-level Acoustic
Information [21.527784717450885]
Speech Emotion Recognition aims to help the machine to understand human's subjective emotion from only audio information.
We propose an end-to-end speech emotion recognition system using multi-level acoustic information with a newly designed co-attention module.
arXiv Detail & Related papers (2022-03-29T08:17:28Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Speech Command Recognition in Computationally Constrained Environments
with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network.
The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers.
This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z) - Attention Driven Fusion for Multi-Modal Emotion Recognition [39.295892047505816]
We present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification.
We use a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN.
For text processing, we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in parallel where cross attention is introduced to infer the N-gram level correlations.
arXiv Detail & Related papers (2020-09-23T08:07:58Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.