Related papers: Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias

URL: http://arxiv.org/abs/2509.22061v1
Date: Fri, 26 Sep 2025 08:43:25 GMT
Title: Speak Your Mind: The Speech Continuation Task as a Probe of Voice-Based Model Bias
Authors: Shree Harsha Bokkahalli Satish, Harm Lameris, Olivier Perrotin, Gustav Eje Henter, Éva Székely,
Abstract summary: Speech Continuation (SC) is the task of generating a coherent extension of a spoken prompt while preserving semantic context and speaker identity.<n>We present the first systematic evaluation of bias in SC, investigating how gender and phonation type affect continuation behaviour.
Score: 24.932603485660323
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech Continuation (SC) is the task of generating a coherent extension of a spoken prompt while preserving both semantic context and speaker identity. Because SC is constrained to a single audio stream, it offers a more direct setting for probing biases in speech foundation models than dialogue does. In this work we present the first systematic evaluation of bias in SC, investigating how gender and phonation type (breathy, creaky, end-creak) affect continuation behaviour. We evaluate three recent models: SpiritLM (base and expressive), VAE-GSLM, and SpeechGPT across speaker similarity, voice quality preservation, and text-based bias metrics. Results show that while both speaker similarity and coherence remain a challenge, textual evaluations reveal significant model and gender interactions: once coherence is sufficiently high (for VAE-GSLM), gender effects emerge on text-metrics such as agency and sentence polarity. In addition, continuations revert toward modal phonation more strongly for female prompts than for male ones, revealing a systematic voice-quality bias. These findings highlight SC as a controlled probe of socially relevant representational biases in speech foundation models, and suggest that it will become an increasingly informative diagnostic as continuation quality improves.

Related papers

On the Fallacy of Global Token Perplexity in Spoken Language Model Evaluation [88.77441715819366]
Generative spoken language models pretrained on large-scale raw audio can continue a speech prompt with appropriate content.<n>We propose a variety of likelihood- and generative-based evaluation methods that serve in place of naive global token perplexity.
arXiv Detail & Related papers (2026-01-09T22:01:56Z)
Lost in Phonation: Voice Quality Variation as an Evaluation Dimension for Speech Foundation Models [22.710371114925763]
Speech foundation models (SFMs) have enabled the direct processing of spoken language from raw audio, bypassing intermediate textual representations.<n>This capability allows SFMs to be exposed to, and potentially respond to, rich paralinguistic variations embedded in the input speech signal.<n>We introduce a new parallel dataset featuring synthesized modifications to voice quality, designed to evaluate SFM responses to creaky and breathy voice.
arXiv Detail & Related papers (2025-10-29T14:44:44Z)
Chronological Thinking in Full-Duplex Spoken Dialogue Language Models [66.84843878538207]
Chronological Thinking aims to improve response quality in full SDLMs.<n>No additional latency: once the user stops speaking, the agent halts thinking and begins speaking without further delay.<n>Results: Experiments demonstrate the effectiveness of chronological thinking through both objective metrics and human evaluations.
arXiv Detail & Related papers (2025-10-02T10:28:11Z)
Acoustic-based Gender Differentiation in Speech-aware Language Models [3.9845890275228277]
Speech-aware Language Models (SpeechLMs) have fundamentally transformed human-AI interaction by enabling voice-based communication.<n>This paper propose a new dataset that enables systematic analysis of this phenomenon, containing 9,208 speech samples across three categories: Gender-Independent, Gender-Stereotypical, and Gender-Dependent.
arXiv Detail & Related papers (2025-09-25T13:15:01Z)
Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM [4.12691471378072]
This study proposes a methodology leveraging speaker assignment as an analytic tool for bias investigation.<n>We evaluate Bark, a Text-to-Speech (TTS) model, analyzing its default speaker assignments for textual prompts.<n>If Bark's speaker selection systematically aligns with gendered associations, it may reveal patterns in its training data or model design.
arXiv Detail & Related papers (2025-08-19T08:10:55Z)
SCDF: A Speaker Characteristics DeepFake Speech Dataset for Bias Analysis [1.2499537119440245]
Speaker Characteristics Deepfake dataset contains over 237,000 utterances in a balanced representation of both male and female speakers.<n>We show that speaker characteristics significantly influence detection performance, revealing disparities across sex, language, age, and synthesizer type.<n>These findings highlight the need for bias-aware development and provide a foundation for building non-discriminatory deepfake detection systems.
arXiv Detail & Related papers (2025-08-11T12:58:37Z)
SpeechR: A Benchmark for Speech Reasoning in Large Audio-Language Models [60.72029578488467]
SpeechR is a unified benchmark for evaluating reasoning over speech in large audio-language models.<n>It evaluates models along three key dimensions: factual retrieval, procedural inference, and normative judgment.<n> Evaluations on eleven state-of-the-art LALMs reveal that high transcription accuracy does not translate into strong reasoning capabilities.
arXiv Detail & Related papers (2025-08-04T03:28:04Z)
SpeechRole: A Large-Scale Dataset and Benchmark for Evaluating Speech Role-Playing Agents [72.79816494079833]
Role-playing agents have emerged as a promising paradigm for achieving personalized interaction and emotional resonance.<n>Existing research primarily focuses on the textual modality, neglecting the critical dimension of speech in realistic interactive scenarios.<n>We construct SpeechRole-Data, a large-scale, high-quality dataset that comprises 98 diverse roles and 112k speech-based single-turn and multi-turn conversations.
arXiv Detail & Related papers (2025-08-04T03:18:36Z)
Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs) By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases. The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z)
Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models [38.64792118903994]
We evaluate gender bias in SILLMs across four semantic-related tasks. Our analysis reveals that bias levels are language-dependent and vary with different evaluation methods.
arXiv Detail & Related papers (2024-07-09T15:35:43Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
Time out of Mind: Generating Rate of Speech conditioned on emotion and speaker [0.0]
We train a GAN conditioned on emotion to generate worth lengths for a given input text. These word lengths are relative neutral speech and can be provided to a text-to-speech system to generate more expressive speech. We were able to achieve better performances on objective measures for neutral speech, and better time alignment for happy speech when compared to an out-of-box model.
arXiv Detail & Related papers (2023-01-29T02:58:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.