SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition
- URL: http://arxiv.org/abs/2502.00310v1
- Date: Sat, 01 Feb 2025 04:18:06 GMT
- Title: SigWavNet: Learning Multiresolution Signal Wavelet Network for Speech Emotion Recognition
- Authors: Alaa Nfissi, Wassim Bouachir, Nizar Bouguila, Brian Mishara,
- Abstract summary: Speech emotion recognition (SER) plays an important role in emotional states from deciphering speech signals.
This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER.
It exploits the capabilities of wavelets for effective localization in both time and frequency domains.
- Score: 17.568724398229232
- License:
- Abstract: In the field of human-computer interaction and psychological assessment, speech emotion recognition (SER) plays an important role in deciphering emotional states from speech signals. Despite advancements, challenges persist due to system complexity, feature distinctiveness issues, and noise interference. This paper introduces a new end-to-end (E2E) deep learning multi-resolution framework for SER, addressing these limitations by extracting meaningful representations directly from raw waveform speech signals. By leveraging the properties of the fast discrete wavelet transform (FDWT), including the cascade algorithm, conjugate quadrature filter, and coefficient denoising, our approach introduces a learnable model for both wavelet bases and denoising through deep learning techniques. The framework incorporates an activation function for learnable asymmetric hard thresholding of wavelet coefficients. Our approach exploits the capabilities of wavelets for effective localization in both time and frequency domains. We then combine one-dimensional dilated convolutional neural networks (1D dilated CNN) with a spatial attention layer and bidirectional gated recurrent units (Bi-GRU) with a temporal attention layer to efficiently capture the nuanced spatial and temporal characteristics of emotional features. By handling variable-length speech without segmentation and eliminating the need for pre or post-processing, the proposed model outperformed state-of-the-art methods on IEMOCAP and EMO-DB datasets. The source code of this paper is shared on the Github repository: https://github.com/alaaNfissi/SigWavNet-Learning-Multiresolution-Signal-Wavelet-Network-for-Speech-E motion-Recognition.
Related papers
- EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - End-to-End Active Speaker Detection [58.7097258722291]
We propose an end-to-end training network where feature learning and contextual predictions are jointly learned.
We also introduce intertemporal graph neural network (iGNN) blocks, which split the message passing according to the main sources of context in the ASD problem.
Experiments show that the aggregated features from the iGNN blocks are more suitable for ASD, resulting in state-of-the art performance.
arXiv Detail & Related papers (2022-03-27T08:55:28Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Speech recognition for air traffic control via feature learning and
end-to-end training [8.755785876395363]
We propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems.
The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss.
Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner.
arXiv Detail & Related papers (2021-11-04T06:38:21Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Efficient Speech Emotion Recognition Using Multi-Scale CNN and Attention [2.8017924048352576]
We propose a simple yet efficient neural network architecture to exploit both acoustic and lexical informationfrom speech.
The proposed framework using multi-scale con-volutional layers (MSCNN) to obtain both audio and text hid-den representations.
Extensive experiments show that the proposed modeloutperforms previous state-of-the-art methods on IEMOCAPdataset.
arXiv Detail & Related papers (2021-06-08T06:45:42Z) - Attention Driven Fusion for Multi-Modal Emotion Recognition [39.295892047505816]
We present a deep learning-based approach to exploit and fuse text and acoustic data for emotion classification.
We use a SincNet layer, based on parameterized sinc functions with band-pass filters, to extract acoustic features from raw audio followed by a DCNN.
For text processing, we use two branches (a DCNN and a Bi-direction RNN followed by a DCNN) in parallel where cross attention is introduced to infer the N-gram level correlations.
arXiv Detail & Related papers (2020-09-23T08:07:58Z) - Attentive WaveBlock: Complementarity-enhanced Mutual Networks for
Unsupervised Domain Adaptation in Person Re-identification and Beyond [97.25179345878443]
This paper proposes a novel light-weight module, the Attentive WaveBlock (AWB)
AWB can be integrated into the dual networks of mutual learning to enhance the complementarity and further depress noise in the pseudo-labels.
Experiments demonstrate that the proposed method achieves state-of-the-art performance with significant improvements on multiple UDA person re-identification tasks.
arXiv Detail & Related papers (2020-06-11T15:40:40Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.