Improvement and Implementation of a Speech Emotion Recognition Model Based on Dual-Layer LSTM
- URL: http://arxiv.org/abs/2411.09189v2
- Date: Thu, 28 Nov 2024 19:28:50 GMT
- Title: Improvement and Implementation of a Speech Emotion Recognition Model Based on Dual-Layer LSTM
- Authors: Xiaoran Yang, Shuhan Yu, Wenxi Xu,
- Abstract summary: This paper builds upon an existing speech emotion recognition model by adding an additional LSTM layer.
By capturing the long-term dependencies within audio sequences through a dual-layer LSTM network, the model can recognize and classify complex emotional patterns more accurately.
- Score: 0.40964539027092917
- License:
- Abstract: This paper builds upon an existing speech emotion recognition model by adding an additional LSTM layer to improve the accuracy and processing efficiency of emotion recognition from audio data. By capturing the long-term dependencies within audio sequences through a dual-layer LSTM network, the model can recognize and classify complex emotional patterns more accurately. Experiments conducted on the RAVDESS dataset validated this approach, showing that the modified dual layer LSTM model improves accuracy by 2% compared to the single-layer LSTM while significantly reducing recognition latency, thereby enhancing real-time performance. These results indicate that the dual-layer LSTM architecture is highly suitable for handling emotional features with long-term dependencies, providing a viable optimization for speech emotion recognition systems. This research provides a reference for practical applications in fields like intelligent customer service, sentiment analysis and human-computer interaction.
Related papers
- BiLSTM and Attention-Based Modulation Classification of Realistic Wireless Signals [2.0650230600617534]
The proposed model exploits multiple representations of the wireless signal as inputs to the network.
An attention layer is used after the BiLSTM layer to emphasize the important temporal features.
The experimental results on the recent and realistic RML22 dataset demonstrate the superior performance of the proposed model with an accuracy up to around 99%.
arXiv Detail & Related papers (2024-08-14T01:17:19Z) - RLEEGNet: Integrating Brain-Computer Interfaces with Adaptive AI for
Intuitive Responsiveness and High-Accuracy Motor Imagery Classification [0.0]
We introduce a framework that leverages Reinforcement Learning with Deep Q-Networks (DQN) for classification tasks.
We present a preprocessing technique for multiclass motor imagery (MI) classification in a One-Versus-The-Rest (OVR) manner.
The integration of DQN with a 1D-CNN-LSTM architecture optimize the decision-making process in real-time.
arXiv Detail & Related papers (2024-02-09T02:03:13Z) - A Comparative Study of Data Augmentation Techniques for Deep Learning
Based Emotion Recognition [11.928873764689458]
We conduct a comprehensive evaluation of popular deep learning approaches for emotion recognition.
We show that long-range dependencies in the speech signal are critical for emotion recognition.
Speed/rate augmentation offers the most robust performance gain across models.
arXiv Detail & Related papers (2022-11-09T17:27:03Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - A Mixture of Expert Based Deep Neural Network for Improved ASR [4.993304210475779]
MixNet is a novel deep learning architecture for acoustic model in the context of Automatic Speech Recognition (ASR)
In natural speech, overlap in distribution across different acoustic classes is inevitable, which leads to inter-class mis-classification.
Experiments are conducted on a large vocabulary ASR task which show that the proposed architecture provides 13.6% and 10.0% relative reduction in word error rates.
arXiv Detail & Related papers (2021-12-02T07:26:34Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Improving Deep Learning for HAR with shallow LSTMs [70.94062293989832]
We propose to alter the DeepConvLSTM to employ a 1-layered instead of a 2-layered LSTM.
Our results stand in contrast to the belief that one needs at least a 2-layered LSTM when dealing with sequential data.
arXiv Detail & Related papers (2021-08-02T08:14:59Z) - Multi-Perspective LSTM for Joint Visual Representation Learning [81.21490913108835]
We present a novel LSTM cell architecture capable of learning both intra- and inter-perspective relationships available in visual sequences captured from multiple perspectives.
Our architecture adopts a novel recurrent joint learning strategy that uses additional gates and memories at the cell level.
We show that by using the proposed cell to create a network, more effective and richer visual representations are learned for recognition tasks.
arXiv Detail & Related papers (2021-05-06T16:44:40Z) - Multi-view Frequency LSTM: An Efficient Frontend for Automatic Speech
Recognition [4.753402561130792]
We present a simple and efficient modification by combining the outputs of multiple FLSTM stacks with different views.
We show that the multi-view FLSTM acoustic model provides relative Word Error Rate (WER) improvements of 3-7% for different speaker and acoustic environment scenarios.
arXiv Detail & Related papers (2020-06-30T22:19:53Z) - High-Accuracy and Low-Latency Speech Recognition with Two-Head
Contextual Layer Trajectory LSTM Model [46.34788932277904]
We improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition.
To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks.
We further improve the training strategy with sequence-level teacher-student learning.
arXiv Detail & Related papers (2020-03-17T00:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.