Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect
- URL: http://arxiv.org/abs/2505.21809v1
- Date: Tue, 27 May 2025 22:30:56 GMT
- Title: Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect
- Authors: Jaya Narain, Vasudha Kowtha, Colin Lea, Lauren Tooley, Dianna Yee, Vikramjit Mitra, Zifang Huang, Miquel Espi Marques, Jon Huang, Carlos Avendano, Shirley Ren,
- Abstract summary: Perceptual voice quality dimensions describe key characteristics of atypical speech and other speech modulations.<n>We develop and evaluate voice quality models for seven voice and speech dimensions.
- Score: 6.284447200986156
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Perceptual voice quality dimensions describe key characteristics of atypical speech and other speech modulations. Here we develop and evaluate voice quality models for seven voice and speech dimensions (intelligibility, imprecise consonants, harsh voice, naturalness, monoloudness, monopitch, and breathiness). Probes were trained on the public Speech Accessibility (SAP) project dataset with 11,184 samples from 434 speakers, using embeddings from frozen pre-trained models as features. We found that our probes had both strong performance and strong generalization across speech elicitation categories in the SAP dataset. We further validated zero-shot performance on additional datasets, encompassing unseen languages and tasks: Italian atypical speech, English atypical speech, and affective speech. The strong zero-shot performance and the interpretability of results across an array of evaluations suggests the utility of using voice quality dimensions in speaking style-related tasks.
Related papers
- SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents [52.29009595100625]
Role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance.<n>Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios.<n>We construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations.
arXiv Detail & Related papers (2025-08-04T03:18:36Z) - Affect Models Have Weak Generalizability to Atypical Speech [6.392336908224424]
We evaluate models for recognizing categorical and dimensional affect from speech on a dataset of atypical speech.<n>We find that the output of affect models is significantly impacted by the presence and degree of speech atypicalities.
arXiv Detail & Related papers (2025-04-22T21:40:17Z) - QualiSpeech: A Speech Quality Assessment Dataset with Natural Language Reasoning and Descriptions [45.34333059156364]
We introduce QualiSpeech, a comprehensive low-level speech quality assessment dataset.<n>We also propose the QualiSpeech Benchmark to evaluate the low-level speech understanding capabilities of auditory large language models.
arXiv Detail & Related papers (2025-03-26T07:32:20Z) - Classification of Spontaneous and Scripted Speech for Multilingual Audio [9.925703861731506]
Distinguishing scripted from spontaneous speech is an essential tool for better understanding how speech styles influence speech processing research.<n>This paper addresses the challenge of building a classifier that generalises well across different formats and languages.<n>We systematically evaluate models ranging from traditional, handcrafted acoustic and prosodic features to advanced audio transformers.
arXiv Detail & Related papers (2024-12-16T15:45:10Z) - Speechworthy Instruction-tuned Language Models [71.8586707840169]
We show that both prompting and preference learning increase the speech-suitability of popular instruction-tuned LLMs.
We share lexical, syntactical, and qualitative analyses to showcase how each method contributes to improving the speech-suitability of generated responses.
arXiv Detail & Related papers (2024-09-23T02:34:42Z) - EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation [83.29199726650899]
The EARS dataset comprises 107 speakers from diverse backgrounds, totaling in 100 hours of clean, anechoic speech data.
The dataset covers a large range of different speaking styles, including emotional speech, different reading styles, non-verbal sounds, and conversational freeform speech.
We benchmark various methods for speech enhancement and dereverberation on the dataset and evaluate their performance through a set of instrumental metrics.
arXiv Detail & Related papers (2024-06-10T11:28:29Z) - Paralinguistics-Enhanced Large Language Modeling of Spoken Dialogue [71.15186328127409]
Paralinguistics-enhanced Generative Pretrained Transformer (ParalinGPT)
Model takes the conversational context of text, speech embeddings, and paralinguistic attributes as input prompts within a serialized multitasking framework.
We utilize the Switchboard-1 corpus, including its sentiment labels as the paralinguistic attribute, as our spoken dialogue dataset.
arXiv Detail & Related papers (2023-12-23T18:14:56Z) - Pre-Finetuning for Few-Shot Emotional Speech Recognition [20.894029832911617]
We view speaker adaptation as a few-shot learning problem.
We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives.
arXiv Detail & Related papers (2023-02-24T22:38:54Z) - ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly
Disentangled Self-supervised Speech Representations [12.20522794248598]
We propose a zero-shot voice conversion method using speech representations trained with self-supervised learning.
We develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style.
Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its representation.
arXiv Detail & Related papers (2023-02-16T08:10:41Z) - The Ability of Self-Supervised Speech Models for Audio Representations [53.19715501273934]
Self-supervised learning (SSL) speech models have achieved unprecedented success in speech representation learning.
We conduct extensive experiments on abundant speech and non-speech audio datasets to evaluate the representation ability of state-of-the-art SSL speech models.
Results show that SSL speech models could extract meaningful features of a wide range of non-speech audio, while they may also fail on certain types of datasets.
arXiv Detail & Related papers (2022-09-26T15:21:06Z) - Prediction of Listener Perception of Argumentative Speech in a
Crowdsourced Data Using (Psycho-)Linguistic and Fluency Features [24.14001104126045]
We aim to predict TED talk-style affective ratings in a crowdsourced dataset of argumentative speech.
We present an effective approach to the classification task of predicting these categories through fine-tuning a model pre-trained on a large dataset of TED talks public speeches.
arXiv Detail & Related papers (2021-11-13T15:07:13Z) - AdaSpeech: Adaptive Text to Speech for Custom Voice [104.69219752194863]
We propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices.
Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker.
arXiv Detail & Related papers (2021-03-01T13:28:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.