Related papers: Speaker Embeddings as Individuality Proxy for Voice Stress Detection

Speaker Embeddings as Individuality Proxy for Voice Stress Detection

URL: http://arxiv.org/abs/2306.05915v1
Date: Fri, 9 Jun 2023 14:11:07 GMT
Title: Speaker Embeddings as Individuality Proxy for Voice Stress Detection
Authors: Zihan Wu, Neil Scheidwasser-Clow, Karl El Hajal, Milos Cernak
Abstract summary: Since the mental states of the speaker modulate speech, stress introduced by cognitive or physical loads could be detected in the voice. The existing voice stress detection benchmark has shown that the audio embeddings extracted from the Hybrid BYOL-S self-supervised model perform well. This paper presents the design and development of voice stress detection, trained on more than 100 speakers from 9 language groups and five different types of stress.
Score: 14.332772222772668
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Since the mental states of the speaker modulate speech, stress introduced by cognitive or physical loads could be detected in the voice. The existing voice stress detection benchmark has shown that the audio embeddings extracted from the Hybrid BYOL-S self-supervised model perform well. However, the benchmark only evaluates performance separately on each dataset, but does not evaluate performance across the different types of stress and different languages. Moreover, previous studies found strong individual differences in stress susceptibility. This paper presents the design and development of voice stress detection, trained on more than 100 speakers from 9 language groups and five different types of stress. We address individual variabilities in voice stress analysis by adding speaker embeddings to the hybrid BYOL-S features. The proposed method significantly improves voice stress detection performance with an input audio length of only 3-5 seconds.

Related papers

SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models [60.72029578488467]
SpeechR is a unified benchmark for evaluating reasoning over speech in large audio-language models.<n>It evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment.<n> Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities.
arXiv Detail & Related papers (2025-08-04T03:28:04Z)
AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation [55.607230723223346]
This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges.<n>We explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking.<n>We introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on
arXiv Detail & Related papers (2025-07-17T00:39:18Z)
Incorporating Linguistic Constraints from External Knowledge Source for Audio-Visual Target Speech Extraction [87.49303116989708]
We explore the potential of pre-trained speech-language models (PSLMs) and pre-trained language models (PLMs) as auxiliary knowledge sources for AV-TSE.<n>In this study, we propose incorporating the linguistic constraints from PSLMs or PLMs for the AV-TSE model as additional supervision signals.<n>Without any extra computational cost during inference, the proposed approach consistently improves speech quality and intelligibility.
arXiv Detail & Related papers (2025-06-11T14:36:26Z)
StressTest: Can YOUR Speech LM Handle the Stress? [20.802090523583196]
Sentence stress refers to emphasis placed on specific words within a spoken utterance to highlight or contrast an idea, or to introduce new information.<n>Recent advances in speech-aware language models (SLMs) have enabled direct processing of audio.<n>Despite the crucial role of sentence stress in shaping meaning and speaker intent, it remains largely overlooked in evaluation and development of such models.
arXiv Detail & Related papers (2025-05-28T18:32:56Z)
Voice Quality Dimensions as Interpretable Primitives for Speaking Style for Atypical Speech and Affect [6.284447200986156]
Perceptual voice quality dimensions describe key characteristics of atypical speech and other speech modulations.<n>We develop and evaluate voice quality models for seven voice and speech dimensions.
arXiv Detail & Related papers (2025-05-27T22:30:56Z)
Fine-Tuning Whisper for Inclusive Prosodic Stress Analysis [2.818750423530918]
This study explores fine-tuning OpenAI's Whisper large-v2 ASR model to recognize phrasal, lexical, and contrastive stress in speech. Using a dataset of 66 native English speakers, we assess the model's ability to generalize stress patterns and classify speakers by neurotype and gender.
arXiv Detail & Related papers (2025-03-03T16:48:31Z)
Detecting Syllable-Level Pronunciation Stress with A Self-Attention Model [0.0]
Knowing the stress level for each syllable of spoken English is important for English speakers and learners. This paper presents a self-attention model to identify the stress level for each syllable of spoken English.
arXiv Detail & Related papers (2023-11-01T05:05:49Z)
Emotional Listener Portrait: Realistic Listener Motion Simulation in Conversation [50.35367785674921]
Listener head generation centers on generating non-verbal behaviors of a listener in reference to the information delivered by a speaker. A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation. We propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords. Our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude.
arXiv Detail & Related papers (2023-09-29T18:18:32Z)
Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method. Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z)
Language-Guided Audio-Visual Source Separation via Trimodal Consistency [64.0580750128049]
A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform. We adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets.
arXiv Detail & Related papers (2023-03-28T22:45:40Z)
SPADE: Self-supervised Pretraining for Acoustic DisEntanglement [2.294014185517203]
We introduce a self-supervised approach to disentangle room acoustics from speech. Our results demonstrate that our proposed approach significantly improves performance over a baseline when labeled training data is scarce.
arXiv Detail & Related papers (2023-02-03T01:36:38Z)
Self-Supervised Speech Representations Preserve Speech Characteristics while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations. Converted voices retain a low word error rate within 1% of the original voice. Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z)
Hybrid Handcrafted and Learnable Audio Representation for Analysis of Speech Under Cognitive and Physical Load [17.394964035035866]
We introduce a set of five datasets for task load detection in speech. The voice recordings were collected as either cognitive or physical stress was induced in the cohort of volunteers. We used the datasets to design and evaluate a novel self-supervised audio representation.
arXiv Detail & Related papers (2022-03-30T19:43:21Z)
Unsupervised Personalization of an Emotion Recognition System: The Unique Properties of the Externalization of Valence in Speech [37.6839508524855]
Adapting a speech emotion recognition system to a particular speaker is a hard problem, especially with deep neural networks (DNNs) This study proposes an unsupervised approach to address this problem by searching for speakers in the train set with similar acoustic patterns as the speaker in the test set. We propose three alternative adaptation strategies: unique speaker, oversampling and weighting approaches.
arXiv Detail & Related papers (2022-01-19T22:14:49Z)
Leveraging Pre-trained Language Model for Speech Sentiment Analysis [58.78839114092951]
We explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis. We propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach.
arXiv Detail & Related papers (2021-06-11T20:15:21Z)
Unsupervised Cross-lingual Representation Learning for Speech Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages. We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations. Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features. We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.