Best Practices for Noise-Based Augmentation to Improve the Performance
of Deployable Speech-Based Emotion Recognition Systems
- URL: http://arxiv.org/abs/2104.08806v2
- Date: Thu, 31 Aug 2023 18:26:56 GMT
- Title: Best Practices for Noise-Based Augmentation to Improve the Performance
of Deployable Speech-Based Emotion Recognition Systems
- Authors: Mimansa Jaiswal, Emily Mower Provost
- Abstract summary: Speech emotion recognition is an important component of any human centered system.
Noise augmentation makes one important assumption, that the prediction label should remain the same in presence or absence of noise.
We validate through crowdsourcing that the presence of noise does change the annotation label and hence may alter the original ground truth label.
- Score: 15.013423048411493
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech emotion recognition is an important component of any human centered
system. But speech characteristics produced and perceived by a person can be
influenced by a multitude of reasons, both desirable such as emotion, and
undesirable such as noise. To train robust emotion recognition models, we need
a large, yet realistic data distribution, but emotion datasets are often small
and hence are augmented with noise. Often noise augmentation makes one
important assumption, that the prediction label should remain the same in
presence or absence of noise, which is true for automatic speech recognition
but not necessarily true for perception based tasks. In this paper we make
three novel contributions. We validate through crowdsourcing that the presence
of noise does change the annotation label and hence may alter the original
ground truth label. We then show how disregarding this knowledge and assuming
consistency in ground truth labels propagates to downstream evaluation of ML
models, both for performance evaluation and robustness testing. We end the
paper with a set of recommendations for noise augmentations in speech emotion
recognition datasets.
Related papers
- Disentangle Identity, Cooperate Emotion: Correlation-Aware Emotional Talking Portrait Generation [63.94836524433559]
DICE-Talk is a framework for disentangling identity with emotion and cooperating emotions with similar characteristics.
We develop a disentangled emotion embedder that jointly models audio-visual emotional cues through cross-modal attention.
Second, we introduce a correlation-enhanced emotion conditioning module with learnable Emotion Banks.
Third, we design an emotion discrimination objective that enforces affective consistency during the diffusion process.
arXiv Detail & Related papers (2025-04-25T05:28:21Z) - Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions [4.507408840040573]
We demonstrate that using the probability density function of the emotion grades as targets, provide better performance on benchmark evaluation sets.
We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model.
We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models.
arXiv Detail & Related papers (2025-03-24T06:13:27Z) - Prompting Audios Using Acoustic Properties For Emotion Representation [36.275219004598874]
We propose the use of natural language descriptions (or prompts) to better represent emotions.
We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts.
Our results show that the acoustic prompts significantly improve the model's performance in various Precision@K metrics.
arXiv Detail & Related papers (2023-10-03T13:06:58Z) - Describing emotions with acoustic property prompts for speech emotion
recognition [30.990720176317463]
We devise a method to automatically create a description for a given audio by computing acoustic properties, such as pitch, loudness, speech rate, and articulation rate.
We train a neural network model using these audio-text pairs and evaluate the model using one more dataset.
We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval.
arXiv Detail & Related papers (2022-11-14T20:29:37Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Investigations on Audiovisual Emotion Recognition in Noisy Conditions [43.40644186593322]
We present an investigation on two emotion datasets with superimposed noise at different signal-to-noise ratios.
The results show a significant performance decrease when a model trained on clean audio is applied to noisy data.
arXiv Detail & Related papers (2021-03-02T17:45:16Z) - Facial Emotion Recognition with Noisy Multi-task Annotations [88.42023952684052]
We introduce a new problem of facial emotion recognition with noisy multi-task annotations.
For this new problem, we suggest a formulation from the point of joint distribution match view.
We exploit a new method to enable the emotion prediction and the joint distribution learning.
arXiv Detail & Related papers (2020-10-19T20:39:37Z) - Embedded Emotions -- A Data Driven Approach to Learn Transferable
Feature Representations from Raw Speech Input for Emotion Recognition [1.4556324908347602]
We investigate the applicability of transferring knowledge learned from large text and audio corpora to the task of automatic emotion recognition.
Our results show that the learned feature representations can be effectively applied for classifying emotions from spoken language.
arXiv Detail & Related papers (2020-09-30T09:18:31Z) - x-vectors meet emotions: A study on dependencies between emotion and
speaker recognition [38.181055783134006]
We show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning.
For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features extracted from pre-trained models.
We present results on the effect of emotion on speaker verification.
arXiv Detail & Related papers (2020-02-12T15:13:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.