Real-time Speech Emotion Recognition Based on Syllable-Level Feature
Extraction
- URL: http://arxiv.org/abs/2204.11382v2
- Date: Tue, 26 Apr 2022 12:01:12 GMT
- Title: Real-time Speech Emotion Recognition Based on Syllable-Level Feature
Extraction
- Authors: Abdul Rehman, Zhen-Tao Liu, Min Wu, Wei-Hua Cao, and Cheng-Shan Jiang
- Abstract summary: We present a speech emotion recognition system based on a reductionist approach of decomposing and analyzing syllable-level features.
A set of syllable-level formant features is extracted and fed into a single hidden layer neural network that makes predictions for each syllable.
Experiments show that the method archives real-time latency while predicting with state-of-the-art cross-corpus unweighted accuracy of 47.6% for IE to MI and 56.2% for MI to IE.
- Score: 7.0019575386261375
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech emotion recognition systems have high prediction latency because of
the high computational requirements for deep learning models and low
generalizability mainly because of the poor reliability of emotional
measurements across multiple corpora. To solve these problems, we present a
speech emotion recognition system based on a reductionist approach of
decomposing and analyzing syllable-level features. Mel-spectrogram of an audio
stream is decomposed into syllable-level components, which are then analyzed to
extract statistical features. The proposed method uses formant attention,
noise-gate filtering, and rolling normalization contexts to increase feature
processing speed and tolerance to adversity. A set of syllable-level formant
features is extracted and fed into a single hidden layer neural network that
makes predictions for each syllable as opposed to the conventional approach of
using a sophisticated deep learner to make sentence-wide predictions. The
syllable level predictions help to achieve the real-time latency and lower the
aggregated error in utterance level cross-corpus predictions. The experiments
on IEMOCAP (IE), MSP-Improv (MI), and RAVDESS (RA) databases show that the
method archives real-time latency while predicting with state-of-the-art
cross-corpus unweighted accuracy of 47.6% for IE to MI and 56.2% for MI to IE.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - YOLO-Stutter: End-to-end Region-Wise Speech Dysfluency Detection [5.42845980208244]
YOLO-Stutter is a first end-to-end method that detects dysfluencies in a time-accurate manner.
VCTK-Stutter and VCTK-TTS simulate natural spoken dysfluencies including repetition, block, missing, replacement, and prolongation.
arXiv Detail & Related papers (2024-08-27T11:31:12Z) - Statistics-aware Audio-visual Deepfake Detector [11.671275975119089]
Methods in audio-visualfake detection mostly assess the synchronization between audio and visual features.
We propose a statistical feature loss to enhance the discrimination capability of the model.
Experiments on the DFDC and FakeAVCeleb datasets demonstrate the relevance of the proposed method.
arXiv Detail & Related papers (2024-07-16T12:15:41Z) - EmoDiarize: Speaker Diarization and Emotion Identification from Speech
Signals using Convolutional Neural Networks [0.0]
This research explores the integration of deep learning techniques in speech emotion recognition.
It introduces a framework that combines a pre-existing speaker diarization pipeline and an emotion identification model built on a Convolutional Neural Network (CNN)
The proposed model yields an unweighted accuracy of 63%, demonstrating remarkable efficiency in accurately identifying emotional states within speech signals.
arXiv Detail & Related papers (2023-10-19T16:02:53Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.