Related papers: Enhanced Speech Emotion Recognition with Efficient Channel Attention Guided Deep CNN-BiLSTM Framework

Enhanced Speech Emotion Recognition with Efficient Channel Attention Guided Deep CNN-BiLSTM Framework

URL: http://arxiv.org/abs/2412.10011v1
Date: Fri, 13 Dec 2024 09:55:03 GMT
Title: Enhanced Speech Emotion Recognition with Efficient Channel Attention Guided Deep CNN-BiLSTM Framework
Authors: Niloy Kumar Kundu, Sarah Kobir, Md. Rayhan Ahmed, Tahmina Aktar, Niloya Roy,
Abstract summary: Speech emotion recognition (SER) is crucial for enhancing affective computing and enriching the domain of human-computer interaction.<n>We propose a lightweight SER architecture that integrates attention-based local feature blocks (ALFBs) to capture high-level relevant feature vectors from speech signals.<n>We also incorporate a global feature block (GFB) technique to capture sequential, global information and long-term dependencies in speech signals.
Score: 0.7864304771129751
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Speech emotion recognition (SER) is crucial for enhancing affective computing and enriching the domain of human-computer interaction. However, the main challenge in SER lies in selecting relevant feature representations from speech signals with lower computational costs. In this paper, we propose a lightweight SER architecture that integrates attention-based local feature blocks (ALFBs) to capture high-level relevant feature vectors from speech signals. We also incorporate a global feature block (GFB) technique to capture sequential, global information and long-term dependencies in speech signals. By aggregating attention-based local and global contextual feature vectors, our model effectively captures the internal correlation between salient features that reflect complex human emotional cues. To evaluate our approach, we extracted four types of spectral features from speech audio samples: mel-frequency cepstral coefficients, mel-spectrogram, root mean square value, and zero-crossing rate. Through a 5-fold cross-validation strategy, we tested the proposed method on five multi-lingual standard benchmark datasets: TESS, RAVDESS, BanglaSER, SUBESCO, and Emo-DB, and obtained a mean accuracy of 99.65%, 94.88%, 98.12%, 97.94%, and 97.19% respectively. The results indicate that our model achieves state-of-the-art (SOTA) performance compared to most existing methods.

Related papers

Toward Efficient Speech Emotion Recognition via Spectral Learning and Attention [0.5371337604556311]
Speech Emotion Recognition (SER) traditionally relies on auditory data analysis for emotion classification.<n>We use Mel-Frequency Cepstral Coefficients (MFCCs) as spectral features to bridge the gap between computational emotion processing and human auditory perception.<n>We propose a novel 1D-CNN-based SER framework that integrates data augmentation techniques.
arXiv Detail & Related papers (2025-07-04T01:55:49Z)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR) MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z)
SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition [17.568724398229232]
Speech emotion recognition (SER) plays an important role in emotional states from deciphering speech signals. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER. It exploits the capabilities of wavelets for effective localization in both time and frequency domains.
arXiv Detail & Related papers (2025-02-01T04:18:06Z)
Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition [60.58049741496505]
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. We propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75%.
arXiv Detail & Related papers (2025-01-06T14:31:25Z)
Noise-Resilient Unsupervised Graph Representation Learning via Multi-Hop Feature Quality Estimation [53.91958614666386]
Unsupervised graph representation learning (UGRL) based on graph neural networks (GNNs) We propose a novel UGRL method based on Multi-hop feature Quality Estimation (MQE)
arXiv Detail & Related papers (2024-07-29T12:24:28Z)
EmoDiarize: Speaker Diarization and Emotion Identification from Speech Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition. It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN) The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z)
Open-Vocabulary Animal Keypoint Detection with Semantic-feature Matching [74.75284453828017]
Open-Vocabulary Keypoint Detection (OVKD) task is innovatively designed to use text prompts for identifying arbitrary keypoints across any species. We have developed a novel framework named Open-Vocabulary Keypoint Detection with Semantic-feature Matching (KDSM) This framework combines vision and language models, creating an interplay between language features and local keypoint visual features.
arXiv Detail & Related papers (2023-10-08T07:42:41Z)
TOPIQ: A Top-down Approach from Semantics to Distortions for Image Quality Assessment [53.72721476803585]
Image Quality Assessment (IQA) is a fundamental task in computer vision that has witnessed remarkable progress with deep neural networks. We propose a top-down approach that uses high-level semantics to guide the IQA network to focus on semantically important local distortion regions. A key component of our approach is the proposed cross-scale attention mechanism, which calculates attention maps for lower level features.
arXiv Detail & Related papers (2023-08-06T09:08:37Z)
Adaptive Local-Component-aware Graph Convolutional Network for One-shot Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition. Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z)
Speech Emotion Recognition with Global-Aware Fusion on Multi-scale Feature Representation [5.20970006627454]
Speech Emotion Recognition (SER) is a fundamental task to predict the emotion label from speech data. Recent works mostly focus on using convolutional neural networks(CNNs) to learn local attention map on fixed-scale feature representation. We propose a novel GLobal-Aware Multi-scale (GLAM) neural network to learn multi-scale feature representation with global-aware fusion module.
arXiv Detail & Related papers (2022-04-12T07:03:04Z)
Learning Speech Emotion Representations in the Quaternion Domain [16.596137913051212]
RH-emo is a novel semi-supervised architecture aimed at extracting quaternion embeddings from real-valued monoaural spectrograms. RH-emo is a hybrid real/quaternion autoencoder network that consists of a real-valued encoder in parallel to a real-valued emotion classifier and a quaternion-valued decoder. We test our approach on speech emotion recognition tasks using four popular datasets: Iemocap, Ravdess, EmoDb and Tess.
arXiv Detail & Related papers (2022-04-05T17:45:09Z)
Speech Emotion Recognition Using Quaternion Convolutional Neural Networks [1.776746672434207]
This paper proposes a quaternion convolutional neural network (QCNN) based speech emotion recognition model. Mel-spectrogram features of speech signals are encoded in an RGB quaternion domain. The model achieves an accuracy of 77.87%, 70.46%, and 88.78% for the RAVDESS, IEMOCAP, and EMO-DB datasets, respectively.
arXiv Detail & Related papers (2021-10-31T04:06:07Z)
Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity. We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)
Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention [2.8017924048352576]
We propose a simple yet efficient neural network architecture to exploit both acoustic and lexical informationfrom speech. The proposed framework using multi-scale con-volutional layers (MSCNN) to obtain both audio and text hid-den representations. Extensive experiments show that the proposed modeloutperforms previous state-of-the-art methods on IEMOCAPdataset.
arXiv Detail & Related papers (2021-06-08T06:45:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.