Speech Emotion Recognition with Global-Aware Fusion on Multi-scale
Feature Representation
- URL: http://arxiv.org/abs/2204.05571v1
- Date: Tue, 12 Apr 2022 07:03:04 GMT
- Title: Speech Emotion Recognition with Global-Aware Fusion on Multi-scale
Feature Representation
- Authors: Wenjing Zhu, Xiang Li
- Abstract summary: Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data.
Recent works mostly focus on using convolutional neural networks(CNNs) to learn local attention map on fixed-scale feature representation.
We propose a novel GLobal-Aware Multi-scale (GLAM) neural network to learn multi-scale feature representation with global-aware fusion module.
- Score: 5.20970006627454
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speech Emotion Recognition (SER) is a fundamental task to predict the emotion
label from speech data. Recent works mostly focus on using convolutional neural
networks~(CNNs) to learn local attention map on fixed-scale feature
representation by viewing time-varied spectral features as images. However,
rich emotional feature at different scales and important global information are
not able to be well captured due to the limits of existing CNNs for SER. In
this paper, we propose a novel GLobal-Aware Multi-scale (GLAM) neural network
(The code is available at https://github.com/lixiangucas01/GLAM) to learn
multi-scale feature representation with global-aware fusion module to attend
emotional information. Specifically, GLAM iteratively utilizes multiple
convolutional kernels with different scales to learn multiple feature
representation. Then, instead of using attention-based methods, a simple but
effective global-aware fusion module is applied to grab most important
emotional information globally. Experiments on the benchmark corpus IEMOCAP
over four emotions demonstrates the superiority of our proposed model with 2.5%
to 4.5% improvements on four common metrics compared to previous
state-of-the-art approaches.
Related papers
- A Contextualized Real-Time Multimodal Emotion Recognition for
Conversational Agents using Graph Convolutional Networks in Reinforcement
Learning [0.800062359410795]
We present a novel paradigm for contextualized Emotion Recognition using Graph Convolutional Network with Reinforcement Learning (conER-GRL)
Conversations are partitioned into smaller groups of utterances for effective extraction of contextual information.
The system uses Gated Recurrent Units (GRU) to extract multimodal features from these groups of utterances.
arXiv Detail & Related papers (2023-10-24T14:31:17Z) - TOPIQ: A Top-down Approach from Semantics to Distortions for Image
Quality Assessment [53.72721476803585]
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks.
We propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions.
A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features.
arXiv Detail & Related papers (2023-08-06T09:08:37Z) - EMERSK -- Explainable Multimodal Emotion Recognition with Situational
Knowledge [0.0]
We present Explainable Multimodal Emotion Recognition with Situational Knowledge (EMERSK)
EMERSK is a general system for human emotion recognition and explanation using visual information.
Our system can handle multiple modalities, including facial expressions, posture, and gait in a flexible and modular manner.
arXiv Detail & Related papers (2023-06-14T17:52:37Z) - GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion
Causality for Speech Emotion Recognition [14.700043991797537]
We propose a Gated Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel emotional causality representation learning component.
GM-TCNet deploys a novel emotional causality representation learning component to capture the dynamics of emotion across the time domain.
Our model maintains the highest performance in most cases compared to state-of-the-art techniques.
arXiv Detail & Related papers (2022-10-28T02:00:40Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z) - Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention [2.8017924048352576]
We propose a simple yet efficient neural network architecture to exploit both acoustic and lexical informationfrom speech.
The proposed framework using multi-scale con-volutional layers (MSCNN) to obtain both audio and text hid-den representations.
Extensive experiments show that the proposed modeloutperforms previous state-of-the-art methods on IEMOCAPdataset.
arXiv Detail & Related papers (2021-06-08T06:45:42Z) - Encoder Fusion Network with Co-Attention Embedding for Referring Image
Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network.
A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features.
The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z) - The Mind's Eye: Visualizing Class-Agnostic Features of CNNs [92.39082696657874]
We propose an approach to visually interpret CNN features given a set of images by creating corresponding images that depict the most informative features of a specific layer.
Our method uses a dual-objective activation and distance loss, without requiring a generator network nor modifications to the original model.
arXiv Detail & Related papers (2021-01-29T07:46:39Z) - EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's
Principle [71.47160118286226]
We present EmotiCon, a learning-based algorithm for context-aware perceived human emotion recognition from videos and images.
Motivated by Frege's Context Principle from psychology, our approach combines three interpretations of context for emotion recognition.
We report an Average Precision (AP) score of 35.48 across 26 classes, which is an improvement of 7-8 over prior methods.
arXiv Detail & Related papers (2020-03-14T19:55:21Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.