Pre-trained Model Representations and their Robustness against Noise for
Speech Emotion Analysis
- URL: http://arxiv.org/abs/2303.03177v1
- Date: Fri, 3 Mar 2023 18:22:32 GMT
- Title: Pre-trained Model Representations and their Robustness against Noise for
Speech Emotion Analysis
- Authors: Vikramjit Mitra, Vasudha Kowtha, Hsiang-Yun Sherry Chien, Erdrin
Azemi, Carlos Avendano
- Abstract summary: We used multi-modal fusion representations from pre-trained models to generate state-of-the-art speech emotion estimation.
We discovered that lexical representations are more robust to distortions compared to acoustic representations.
- Score: 6.382013662443799
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pre-trained model representations have demonstrated state-of-the-art
performance in speech recognition, natural language processing, and other
applications. Speech models, such as Bidirectional Encoder Representations from
Transformers (BERT) and Hidden units BERT (HuBERT), have enabled generating
lexical and acoustic representations to benefit speech recognition
applications. We investigated the use of pre-trained model representations for
estimating dimensional emotions, such as activation, valence, and dominance,
from speech. We observed that while valence may rely heavily on lexical
representations, activation and dominance rely mostly on acoustic information.
In this work, we used multi-modal fusion representations from pre-trained
models to generate state-of-the-art speech emotion estimation, and we showed a
100% and 30% relative improvement in concordance correlation coefficient (CCC)
on valence estimation compared to standard acoustic and lexical baselines.
Finally, we investigated the robustness of pre-trained model representations
against noise and reverberation degradation and noticed that lexical and
acoustic representations are impacted differently. We discovered that lexical
representations are more robust to distortions compared to acoustic
representations, and demonstrated that knowledge distillation from a
multi-modal model helps to improve the noise-robustness of acoustic-based
models.
Related papers
- Incorporating Talker Identity Aids With Improving Speech Recognition in Adversarial Environments [0.2916558661202724]
We develop a transformer-based model that jointly performs speech recognition and speaker identification.
We show that the joint model performs comparably to Whisper under clean conditions.
Our results suggest that integrating voice representations with speech recognition can lead to more robust models under adversarial conditions.
arXiv Detail & Related papers (2024-10-07T18:39:59Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Probing self-supervised speech models for phonetic and phonemic
information: a case study in aspiration [17.94683764469626]
We evaluate the extent to which these models' learned representations align with basic representational distinctions made by humans.
We find that robust representations of both phonetic and phonemic distinctions emerge in early layers of these models' architectures.
Our findings show that speech-trained HuBERT derives a low-noise and low-dimensional subspace corresponding to abstract phonological distinctions.
arXiv Detail & Related papers (2023-06-09T20:07:22Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - Towards Disentangled Speech Representations [65.7834494783044]
We construct a representation learning task based on joint modeling of ASR and TTS.
We seek to learn a representation of audio that disentangles that part of the speech signal that is relevant to transcription from that part which is not.
We show that enforcing these properties during training improves WER by 24.5% relative on average for our joint modeling task.
arXiv Detail & Related papers (2022-08-28T10:03:55Z) - Speech Emotion: Investigating Model Representations, Multi-Task Learning
and Knowledge Distillation [6.382013662443799]
Estimating dimensional emotions from acoustic speech signals is challenging.
We show that pre-trained acoustic models can improve valence estimation from speech.
We report new state-of-the-art "text-free" acoustic-only dimensional emotion estimation.
arXiv Detail & Related papers (2022-07-02T17:34:44Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - Fine-grained Noise Control for Multispeaker Speech Synthesis [3.449700218265025]
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.
Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors.
arXiv Detail & Related papers (2022-04-11T13:13:55Z) - Conditional Diffusion Probabilistic Model for Speech Enhancement [101.4893074984667]
We propose a novel speech enhancement algorithm that incorporates characteristics of the observed noisy speech signal into the diffusion and reverse processes.
In our experiments, we demonstrate strong performance of the proposed approach compared to representative generative models.
arXiv Detail & Related papers (2022-02-10T18:58:01Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.