Affect Models Have Weak Generalizability to Atypical Speech
- URL: http://arxiv.org/abs/2504.16283v1
- Date: Tue, 22 Apr 2025 21:40:17 GMT
- Title: Affect Models Have Weak Generalizability to Atypical Speech
- Authors: Jaya Narain, Amrit Romana, Vikramjit Mitra, Colin Lea, Shirley Ren,
- Abstract summary: We evaluate models for recognizing categorical and dimensional affect from speech on a dataset of atypical speech.<n>We find that the output of affect models is significantly impacted by the presence and degree of speech atypicalities.
- Score: 6.392336908224424
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Speech and voice conditions can alter the acoustic properties of speech, which could impact the performance of paralinguistic models for affect for people with atypical speech. We evaluate publicly available models for recognizing categorical and dimensional affect from speech on a dataset of atypical speech, comparing results to datasets of typical speech. We investigate three dimensions of speech atypicality: intelligibility, which is related to pronounciation; monopitch, which is related to prosody, and harshness, which is related to voice quality. We look at (1) distributional trends of categorical affect predictions within the dataset, (2) distributional comparisons of categorical affect predictions to similar datasets of typical speech, and (3) correlation strengths between text and speech predictions for spontaneous speech for valence and arousal. We find that the output of affect models is significantly impacted by the presence and degree of speech atypicalities. For instance, the percentage of speech predicted as sad is significantly higher for all types and grades of atypical speech when compared to similar typical speech datasets. In a preliminary investigation on improving robustness for atypical speech, we find that fine-tuning models on pseudo-labeled atypical speech data improves performance on atypical speech without impacting performance on typical speech. Our results emphasize the need for broader training and evaluation datasets for speech emotion models, and for modeling approaches that are robust to voice and speech differences.
Related papers
- Modeling speech emotion with label variance and analyzing performance across speakers and unseen acoustic conditions [4.507408840040573]
We demonstrate that using the probability density function of the emotion grades as targets, provide better performance on benchmark evaluation sets.<n>We show that a saliency driven foundation model (FM) representation selection helps to train a state-of-the-art speech emotion model.<n>We demonstrate that performance evaluation across multiple test-sets and performance analysis across gender and speakers are useful in assessing usefulness of emotion models.
arXiv Detail & Related papers (2025-03-24T06:13:27Z) - Enhancing Indonesian Automatic Speech Recognition: Evaluating Multilingual Models with Diverse Speech Variabilities [9.473861847584843]
We present our research on state-of-the-art speech recognition models, namely Massively Multilingual Speech (MMS) and Whisper.
We investigate the models' predictive ability to transcribe Indonesian speech data across different variability groups.
arXiv Detail & Related papers (2024-10-11T14:07:07Z) - Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - Interpreting Pretrained Speech Models for Automatic Speech Assessment of Voice Disorders [0.8796261172196743]
We train and compare two configurations of Audio Spectrogram Transformer in the context of Voice Disorder Detection.
We apply the attention rollout method to produce model relevance maps, the computed relevance of the spectrogram regions when the model makes predictions.
We use these maps to analyse how models make predictions in different conditions and to show that the spread of attention is reduced as a model is finetuned.
arXiv Detail & Related papers (2024-06-29T21:14:48Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - LLMs and Finetuning: Benchmarking cross-domain performance for hate speech detection [10.014248704653]
This study investigates the effectiveness and adaptability of pre-trained and fine-tuned Large Language Models (LLMs) in identifying hate speech.<n>LLMs offer a huge advantage over the state-of-the-art even without pretraining.<n>We conclude with a vision for the future of hate speech detection, emphasizing cross-domain generalizability and appropriate benchmarking practices.
arXiv Detail & Related papers (2023-10-29T10:07:32Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - Pre-Finetuning for Few-Shot Emotional Speech Recognition [20.894029832911617]
We view speaker adaptation as a few-shot learning problem.
We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives.
arXiv Detail & Related papers (2023-02-24T22:38:54Z) - Predicting non-native speech perception using the Perceptual
Assimilation Model and state-of-the-art acoustic models [9.858745856649998]
We present a new, open dataset of French- and English-speaking participants' speech perception behaviour for 61 vowel sounds.
We show that phoneme assimilation is a better predictor than fine-grained phonetic modelling, both for the discrimination behaviour as a whole.
We also show that wav2vec 2.0, while not good at capturing the effects of native language on speech perception, is complementary to information about native phoneme assimilation.
arXiv Detail & Related papers (2022-05-31T14:25:59Z) - Statistical Analysis of Perspective Scores on Hate Speech Detection [7.447951461558536]
State-of-the-art hate speech classifiers are efficient only when tested on the data with the same feature distribution as training data.
In such a diverse data distribution relying on low level features is the main cause of deficiency due to natural bias in data.
We show that, different hate speech datasets are very similar when it comes to extract their Perspective Scores.
arXiv Detail & Related papers (2021-06-22T17:17:35Z) - Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.
We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.