Speech Emotion: Investigating Model Representations, Multi-Task Learning
and Knowledge Distillation
- URL: http://arxiv.org/abs/2207.03334v1
- Date: Sat, 2 Jul 2022 17:34:44 GMT
- Title: Speech Emotion: Investigating Model Representations, Multi-Task Learning
and Knowledge Distillation
- Authors: Vikramjit Mitra, Hsiang-Yun Sherry Chien, Vasudha Kowtha, Joseph Yitan
Cheng, Erdrin Azemi
- Abstract summary: Estimating dimensional emotions from acoustic speech signals is challenging.
We show that pre-trained acoustic models can improve valence estimation from speech.
We report new state-of-the-art "text-free" acoustic-only dimensional emotion estimation.
- Score: 6.382013662443799
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Estimating dimensional emotions, such as activation, valence and dominance,
from acoustic speech signals has been widely explored over the past few years.
While accurate estimation of activation and dominance from speech seem to be
possible, the same for valence remains challenging. Previous research has shown
that the use of lexical information can improve valence estimation performance.
Lexical information can be obtained from pre-trained acoustic models, where the
learned representations can improve valence estimation from speech. We
investigate the use of pre-trained model representations to improve valence
estimation from acoustic speech signal. We also explore fusion of
representations to improve emotion estimation across all three emotion
dimensions: activation, valence and dominance. Additionally, we investigate if
representations from pre-trained models can be distilled into models trained
with low-level features, resulting in models with a less number of parameters.
We show that fusion of pre-trained model embeddings result in a 79% relative
improvement in concordance correlation coefficient CCC on valence estimation
compared to standard acoustic feature baseline (mel-filterbank energies), while
distillation from pre-trained model embeddings to lower-dimensional
representations yielded a relative 12% improvement. Such performance gains were
observed over two evaluation sets, indicating that our proposed architecture
generalizes across those evaluation sets. We report new state-of-the-art
"text-free" acoustic-only dimensional emotion estimation $CCC$ values on two
MSP-Podcast evaluation sets.
Related papers
- Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - End-to-End Speech Recognition and Disfluency Removal with Acoustic
Language Model Pretraining [0.0]
We revisit the performance comparison between two-stage and end-to-end model.
We find that audio based language models pretrained using weak self-supervised objectives match or exceed the performance of similarly trained two-stage models.
arXiv Detail & Related papers (2023-09-08T17:12:14Z) - A study on the impact of Self-Supervised Learning on automatic dysarthric speech assessment [6.284142286798582]
We show that HuBERT is the most versatile feature extractor across dysarthria classification, word recognition, and intelligibility classification, achieving respectively $+24.7%, +61%, textand +7.2%$ accuracy compared to classical acoustic features.
arXiv Detail & Related papers (2023-06-07T11:04:02Z) - Pre-trained Model Representations and their Robustness against Noise for
Speech Emotion Analysis [6.382013662443799]
We used multi-modal fusion representations from pre-trained models to generate state-of-the-art speech emotion estimation.
We discovered that lexical representations are more robust to distortions compared to acoustic representations.
arXiv Detail & Related papers (2023-03-03T18:22:32Z) - Ensemble knowledge distillation of self-supervised speech models [84.69577440755457]
Distilled self-supervised models have shown competitive performance and efficiency in recent years.
We performed Ensemble Knowledge Distillation (EKD) on various self-supervised speech models such as HuBERT, RobustHuBERT, and WavLM.
Our method improves the performance of the distilled models on four downstream speech processing tasks.
arXiv Detail & Related papers (2023-02-24T17:15:39Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - An Exploration of Self-Supervised Pretrained Representations for
End-to-End Speech Recognition [98.70304981174748]
We focus on the general applications of pretrained speech representations, on advanced end-to-end automatic speech recognition (E2E-ASR) models.
We select several pretrained speech representations and present the experimental results on various open-source and publicly available corpora for E2E-ASR.
arXiv Detail & Related papers (2021-10-09T15:06:09Z) - Avoiding Inference Heuristics in Few-shot Prompt-based Finetuning [57.4036085386653]
We show that prompt-based models for sentence pair classification tasks still suffer from a common pitfall of adopting inferences based on lexical overlap.
We then show that adding a regularization that preserves pretraining weights is effective in mitigating this destructive tendency of few-shot finetuning.
arXiv Detail & Related papers (2021-09-09T10:10:29Z) - Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models.
Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z) - Unsupervised low-rank representations for speech emotion recognition [78.38221758430244]
We examine the use of linear and non-linear dimensionality reduction algorithms for extracting low-rank feature representations for speech emotion recognition.
We report speech emotion recognition (SER) results for learned representations on two databases using different classification methods.
arXiv Detail & Related papers (2021-04-14T18:30:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.