Unsupervised Personalization of an Emotion Recognition System: The
Unique Properties of the Externalization of Valence in Speech
- URL: http://arxiv.org/abs/2201.07876v1
- Date: Wed, 19 Jan 2022 22:14:49 GMT
- Title: Unsupervised Personalization of an Emotion Recognition System: The
Unique Properties of the Externalization of Valence in Speech
- Authors: Kusha Sridhar and Carlos Busso
- Abstract summary: Adapting a speech emotion recognition system to a particular speaker is a hard problem, especially with deep neural networks (DNNs)
This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set.
We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches.
- Score: 37.6839508524855
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: The prediction of valence from speech is an important, but challenging
problem. The externalization of valence in speech has speaker-dependent cues,
which contribute to performances that are often significantly lower than the
prediction of other emotional attributes such as arousal and dominance. A
practical approach to improve valence prediction from speech is to adapt the
models to the target speakers in the test set. Adapting a speech emotion
recognition (SER) system to a particular speaker is a hard problem, especially
with deep neural networks (DNNs), since it requires optimizing millions of
parameters. This study proposes an unsupervised approach to address this
problem by searching for speakers in the train set with similar acoustic
patterns as the speaker in the test set. Speech samples from the selected
speakers are used to create the adaptation set. This approach leverages
transfer learning using pre-trained models, which are adapted with these speech
samples. We propose three alternative adaptation strategies: unique speaker,
oversampling and weighting approaches. These methods differ on the use of the
adaptation set in the personalization of the valence models. The results
demonstrate that a valence prediction model can be efficiently personalized
with these unsupervised approaches, leading to relative improvements as high as
13.52%.
Related papers
- Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Factorised Speaker-environment Adaptive Training of Conformer Speech
Recognition Systems [31.813788489512394]
This paper proposes a novel factorised speaker-environment adaptive training and test time adaptation approach for Conformer ASR models.
Experiments on the 300-hr WHAM noise corrupted Switchboard data suggest that factorised adaptation consistently outperforms the baseline.
Further analysis shows the proposed method offers potential for rapid adaption to unseen speaker-environment conditions.
arXiv Detail & Related papers (2023-06-26T11:32:05Z) - Pre-Finetuning for Few-Shot Emotional Speech Recognition [20.894029832911617]
We view speaker adaptation as a few-shot learning problem.
We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives.
arXiv Detail & Related papers (2023-02-24T22:38:54Z) - Supervised Acoustic Embeddings And Their Transferability Across
Languages [2.28438857884398]
In speech recognition, it is essential to model the phonetic content of the input signal while discarding irrelevant factors such as speaker variations and noise.
Self-supervised pre-training has been proposed as a way to improve both supervised and unsupervised speech recognition.
arXiv Detail & Related papers (2023-01-03T09:37:24Z) - Zero-Shot Personalized Speech Enhancement through Speaker-Informed Model
Selection [25.05285328404576]
optimizing speech towards a particular test-time speaker can improve performance and reduce run-time complexity.
We propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers.
Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined.
arXiv Detail & Related papers (2021-05-08T00:15:57Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.