Related papers: Speech Emotion Recognition with Global-Aware Fusion on Multi-scale Feature Representation

Speech Emotion Recognition with Global-Aware Fusion on Multi-scale Feature Representation

URL: http://arxiv.org/abs/2204.05571v1
Date: Tue, 12 Apr 2022 07:03:04 GMT
Title: Speech Emotion Recognition with Global-Aware Fusion on Multi-scale Feature Representation
Authors: Wenjing Zhu, Xiang Li
Abstract summary: Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data. Recent works mostly focus on using convolutional neural networks(CNNs) to learn local attention map on fixed-scale feature representation. We propose a novel GLobal-Aware Multi-scale (GLAM) neural network to learn multi-scale feature representation with global-aware fusion module.
Score: 5.20970006627454
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data. Recent works mostly focus on using convolutional neural networks~(CNNs) to learn local attention map on fixed-scale feature representation by viewing time-varied spectral features as images. However, rich emotional feature at different scales and important global information are not able to be well captured due to the limits of existing CNNs for SER. In this paper, we propose a novel GLobal-Aware Multi-scale (GLAM) neural network (The code is available at https://github.com/lixiangucas01/GLAM) to learn multi-scale feature representation with global-aware fusion module to attend emotional information. Specifically, GLAM iteratively utilizes multiple convolutional kernels with different scales to learn multiple feature representation. Then, instead of using attention-based methods, a simple but effective global-aware fusion module is applied to grab most important emotional information globally. Experiments on the benchmark corpus IEMOCAP over four emotions demonstrates the superiority of our proposed model with 2.5% to 4.5% improvements on four common metrics compared to previous state-of-the-art approaches.

Related papers

MGHFT: Multi-Granularity Hierarchical Fusion Transformer for Cross-Modal Sticker Emotion Recognition [29.045940445247872]
We propose a novel multi-granularity hierarchical fusion transformer (MGHFT)<n>We first use Multimodal Large Language Models to interpret stickers.<n>Then, we design a hierarchical fusion strategy to fuse the textual context into visual understanding.
arXiv Detail & Related papers (2025-07-25T03:42:26Z)
Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition [60.58049741496505]
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. We propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75%.
arXiv Detail & Related papers (2025-01-06T14:31:25Z)
Enhanced Speech Emotion Recognition with Efficient Channel Attention Guided Deep CNN-BiLSTM Framework [0.7864304771129751]
Speech emotion recognition (SER) is crucial for enhancing affective computing and enriching the domain of human-computer interaction. We propose a lightweight SER architecture that integrates attention-based local feature blocks (ALFBs) to capture high-level relevant feature vectors from speech signals. We also incorporate a global feature block (GFB) technique to capture sequential, global information and long-term dependencies in speech signals.
arXiv Detail & Related papers (2024-12-13T09:55:03Z)
A Contextualized Real-Time Multimodal Emotion Recognition for Conversational Agents using Graph Convolutional Networks in Reinforcement Learning [0.800062359410795]
We present a novel paradigm for contextualized Emotion Recognition using Graph Convolutional Network with Reinforcement Learning (conER-GRL) Conversations are partitioned into smaller groups of utterances for effective extraction of contextual information. The system uses Gated Recurrent Units (GRU) to extract multimodal features from these groups of utterances.
arXiv Detail & Related papers (2023-10-24T14:31:17Z)
TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment [53.72721476803585]
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks. We propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions. A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features.
arXiv Detail & Related papers (2023-08-06T09:08:37Z)
EMERSK -- Explainable Multimodal Emotion Recognition with Situational Knowledge [0.0]
We present Explainable Multimodal Emotion Recognition with Situational Knowledge (EMERSK) EMERSK is a general system for human emotion recognition and explanation using visual information. Our system can handle multiple modalities, including facial expressions, posture, and gait in a flexible and modular manner.
arXiv Detail & Related papers (2023-06-14T17:52:37Z)
GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion Causality for Speech Emotion Recognition [14.700043991797537]
We propose a Gated Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel emotional causality representation learning component. GM-TCNet deploys a novel emotional causality representation learning component to capture the dynamics of emotion across the time domain. Our model maintains the highest performance in most cases compared to state-of-the-art techniques.
arXiv Detail & Related papers (2022-10-28T02:00:40Z)
Sign Language Recognition via Skeleton-Aware Multi-Model Ensemble [71.97020373520922]
Sign language is commonly used by deaf or mute people to communicate. We propose a novel Multi-modal Framework with a Global Ensemble Model (GEM) for isolated Sign Language Recognition ( SLR) Our proposed SAM- SLR-v2 framework is exceedingly effective and achieves state-of-the-art performance with significant margins.
arXiv Detail & Related papers (2021-10-12T16:57:18Z)
Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features. We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models. Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z)
Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention [2.8017924048352576]
We propose a simple yet efficient neural network architecture to exploit both acoustic and lexical informationfrom speech. The proposed framework using multi-scale con-volutional layers (MSCNN) to obtain both audio and text hid-den representations. Extensive experiments show that the proposed modeloutperforms previous state-of-the-art methods on IEMOCAPdataset.
arXiv Detail & Related papers (2021-06-08T06:45:42Z)
Encoder Fusion Network with Co-Attention Embedding for Referring Image Segmentation [87.01669173673288]
We propose an encoder fusion network (EFN), which transforms the visual encoder into a multi-modal feature learning network. A co-attention mechanism is embedded in the EFN to realize the parallel update of multi-modal features. The experiment results on four benchmark datasets demonstrate that the proposed approach achieves the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-05-05T02:27:25Z)
The Mind's Eye: Visualizing Class-Agnostic Features of CNNs [92.39082696657874]
We propose an approach to visually interpret CNN features given a set of images by creating corresponding images that depict the most informative features of a specific layer. Our method uses a dual-objective activation and distance loss, without requiring a generator network nor modifications to the original model.
arXiv Detail & Related papers (2021-01-29T07:46:39Z)
EmotiCon: Context-Aware Multimodal Emotion Recognition using Frege's Principle [71.47160118286226]
We present EmotiCon, a learning-based algorithm for context-aware perceived human emotion recognition from videos and images. Motivated by Frege's Context Principle from psychology, our approach combines three interpretations of context for emotion recognition. We report an Average Precision (AP) score of 35.48 across 26 classes, which is an improvement of 7-8 over prior methods.
arXiv Detail & Related papers (2020-03-14T19:55:21Z)
An End-to-End Visual-Audio Attention Network for Emotion Recognition in User-Generated Videos [64.91614454412257]
We propose to recognize video emotions in an end-to-end manner based on convolutional neural networks (CNNs) Specifically, we develop a deep Visual-Audio Attention Network (VAANet), a novel architecture that integrates spatial, channel-wise, and temporal attentions into a visual 3D CNN and temporal attentions into an audio 2D CNN.
arXiv Detail & Related papers (2020-02-12T15:33:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.