Speech Emotion Recognition with Global-Aware Fusion on Multi-scale
Feature Representation
- URL: http://arxiv.org/abs/2204.05571v1
- Date: Tue, 12 Apr 2022 07:03:04 GMT
- Title: Speech Emotion Recognition with Global-Aware Fusion on Multi-scale
Feature Representation
- Authors: Wenjing Zhu, Xiang Li
- Abstract summary: Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data.
Recent works mostly focus on using convolutional neural networks(CNNs) to learn local attention map on fixed-scale feature representation.
We propose a novel GLobal-Aware Multi-scale (GLAM) neural network to learn multi-scale feature representation with global-aware fusion module.
- Score: 5.20970006627454
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Speech Emotion Recognition (SER) is a fundamental task to predict the emotion
label from speech data. Recent works mostly focus on using convolutional neural
networks~(CNNs) to learn local attention map on fixed-scale feature
representation by viewing time-varied spectral features as images. However,
rich emotional feature at different scales and important global information are
not able to be well captured due to the limits of existing CNNs for SER. In
this paper, we propose a novel GLobal-Aware Multi-scale (GLAM) neural network
(The code is available at https://github.com/lixiangucas01/GLAM) to learn
multi-scale feature representation with global-aware fusion module to attend
emotional information. Specifically, GLAM iteratively utilizes multiple
convolutional kernels with different scales to learn multiple feature
representation. Then, instead of using attention-based methods, a simple but
effective global-aware fusion module is applied to grab most important
emotional information globally. Experiments on the benchmark corpus IEMOCAP
over four emotions demonstrates the superiority of our proposed model with 2.5%
to 4.5% improvements on four common metrics compared to previous
state-of-the-art approaches.
Related papers
- Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition [60.58049741496505]
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction.
We propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics.
We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75%.
arXiv Detail & Related papers (2025-01-06T14:31:25Z) - Enhanced Speech Emotion Recognition with Efficient Channel Attention Guided Deep CNN-BiLSTM Framework [0.7864304771129751]
Speech emotion recognition (SER) is crucial for enhancing affective computing and enriching the domain of human-computer interaction.
We propose a lightweight SER architecture that integrates attention-based local feature blocks (ALFBs) to capture high-level relevant feature vectors from speech signals.
We also incorporate a global feature block (GFB) technique to capture sequential, global information and long-term dependencies in speech signals.
arXiv Detail & Related papers (2024-12-13T09:55:03Z) - TOPIQ: A Top-down Approach from Semantics to Distortions for Image
Quality Assessment [53.72721476803585]
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks.
We propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions.
A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features.
arXiv Detail & Related papers (2023-08-06T09:08:37Z) - EMERSK -- Explainable Multimodal Emotion Recognition with Situational
Knowledge [0.0]
We present Explainable Multimodal Emotion Recognition with Situational Knowledge (EMERSK)
EMERSK is a general system for human emotion recognition and explanation using visual information.
Our system can handle multiple modalities, including facial expressions, posture, and gait in a flexible and modular manner.
arXiv Detail & Related papers (2023-06-14T17:52:37Z) - GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion
Causality for Speech Emotion Recognition [14.700043991797537]
We propose a Gated Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel emotional causality representation learning component.
GM-TCNet deploys a novel emotional causality representation learning component to capture the dynamics of emotion across the time domain.
Our model maintains the highest performance in most cases compared to state-of-the-art techniques.
arXiv Detail & Related papers (2022-10-28T02:00:40Z) - Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate.
We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR)
Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z) - Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z) - Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention [2.8017924048352576]
We propose a simple yet efficient neural network architecture to exploit both acoustic and lexical informationfrom speech.
The proposed framework using multi-scale con-volutional layers (MSCNN) to obtain both audio and text hid-den representations.
Extensive experiments show that the proposed modeloutperforms previous state-of-the-art methods on IEMOCAPdataset.
arXiv Detail & Related papers (2021-06-08T06:45:42Z) - The Mind's Eye: Visualizing Class-Agnostic Features of CNNs [92.39082696657874]
We propose an approach to visually interpret CNN features given a set of images by creating corresponding images that depict the most informative features of a specific layer.
Our method uses a dual-objective activation and distance loss, without requiring a generator network nor modifications to the original model.
arXiv Detail & Related papers (2021-01-29T07:46:39Z) - EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's
Principle [71.47160118286226]
We present EmotiCon, a learning-based algorithm for context-aware perceived human emotion recognition from videos and images.
Motivated by Frege's Context Principle from psychology, our approach combines three interpretations of context for emotion recognition.
We report an Average Precision (AP) score of 35.48 across 26 classes, which is an improvement of 7-8 over prior methods.
arXiv Detail & Related papers (2020-03-14T19:55:21Z) - An End-to-End Visual-Audio Attention Network for Emotion Recognition in
User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs)
Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.