Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations
- URL: http://arxiv.org/abs/2510.02352v1
- Date: Sat, 27 Sep 2025 16:21:22 GMT
- Title: Evaluating Bias in Spoken Dialogue LLMs for Real-World Decisions and Recommendations
- Authors: Yihao Wu, Tianrui Wang, Yizhou Peng, Yi-Wen Chao, Xuyi Zhuang, Xinsheng Wang, Shunshun Yin, Ziyang Ma,
- Abstract summary: Paralinguistic features, such as age, gender, and accent, can affect model outputs.<n>Open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities.<n>This work provides the first systematic study of biases in end-to-end spoken dialogue models.
- Score: 18.706521321659995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While biases in large language models (LLMs), such as stereotypes and cultural tendencies in outputs, have been examined and identified, their presence and characteristics in spoken dialogue models (SDMs) with audio input and output remain largely unexplored. Paralinguistic features, such as age, gender, and accent, can affect model outputs; when compounded by multi-turn conversations, these effects may exacerbate biases, with potential implications for fairness in decision-making and recommendation tasks. In this paper, we systematically evaluate biases in speech LLMs and study the impact of multi-turn dialogues with repeated negative feedback. Bias is measured using Group Unfairness Score (GUS) for decisions and similarity-based normalized statistics rate (SNSR) for recommendations, across both open-source models like Qwen2.5-Omni and GLM-4-Voice, as well as closed-source APIs such as GPT-4o Audio and Gemini-2.5-Flash. Our analysis reveals that closed-source models generally exhibit lower bias, while open-source models are more sensitive to age and gender, and recommendation tasks tend to amplify cross-group disparities. We found that biased decisions may persist in multi-turn conversations. This work provides the first systematic study of biases in end-to-end spoken dialogue models, offering insights towards fair and reliable audio-based interactive systems. To facilitate further research, we release the FairDialogue dataset and evaluation code.
Related papers
- Predicting Biased Human Decision-Making with Large Language Models in Conversational Settings [0.48439699124726004]
We show that large language models (LLMs) can predict biased decision-making in conversational settings.<n>We also show that their predictions capture not only human cognitive biases but also how those effects change under cognitive load.
arXiv Detail & Related papers (2026-01-16T07:30:21Z) - A Comprehensive Study of Implicit and Explicit Biases in Large Language Models [1.0555164678638427]
This study highlights the need to address biases in Large Language Models amid growing generative AI.<n>We studied bias-specific benchmarks such as StereoSet and CrowSPairs to evaluate the existence of various biases in multiple generative models such as BERT and GPT 3.5.<n>Results indicated fine-tuned models struggle with gender biases but excelled at identifying and avoiding racial biases.
arXiv Detail & Related papers (2025-11-18T05:27:17Z) - VOGUE: A Multimodal Dataset for Conversational Recommendation in Fashion [18.017186369021154]
VOGUE is a novel dataset of 60 humanhuman dialogues in realistic fashion shopping scenarios.<n>Each dialogue is paired with a shared visual catalogue, item metadata, user fashion profiles and histories, and post-conversation ratings from both Seekers and Assistants.<n>Our initial analyses of VOGUE reveal distinctive dynamics of visually grounded dialogue.
arXiv Detail & Related papers (2025-10-24T04:45:29Z) - BiasFreeBench: a Benchmark for Mitigating Bias in Large Language Model Responses [32.58830706120845]
Existing studies on bias mitigation methods for large language models (LLMs) use diverse baselines and metrics to evaluate debiasing performance.<n>We introduce BiasFreeBench, an empirical benchmark that comprehensively compares eight mainstream bias mitigation techniques.<n>We will publicly release our benchmark, aiming to establish a unified testbed for bias mitigation research.
arXiv Detail & Related papers (2025-09-30T19:56:54Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - BIPOLAR: Polarization-based granular framework for LLM bias evaluation [0.0]
This study proposes a reusable, granular, and topic-agnostic framework to evaluate polarisation-related biases in large language models.<n>Our approach combines polarisation-sensitive sentiment metrics with a synthetically generated balanced dataset of conflict-related statements.<n>As a case study, we created a synthetic dataset that focusses on the Russia-Ukraine war, and we evaluated the bias in several LLMs.
arXiv Detail & Related papers (2025-08-14T20:44:19Z) - CO-VADA: A Confidence-Oriented Voice Augmentation Debiasing Approach for Fair Speech Emotion Recognition [49.27067541740956]
We present CO-VADA, a Confidence-Oriented Voice Augmentation Debiasing Approach that mitigates bias without modifying model architecture or relying on demographic information.<n>CO-VADA identifies training samples that reflect bias patterns present in the training data and then applies voice conversion to alter irrelevant attributes and generate samples.<n>Our framework is compatible with various SER models and voice conversion tools, making it a scalable and practical solution for improving fairness in SER systems.
arXiv Detail & Related papers (2025-06-06T13:25:56Z) - $C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction [80.57232374640911]
We propose a model-agnostic strategy called the Mask-And-Recover (MAR)<n>MAR integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules.<n>To better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model.
arXiv Detail & Related papers (2025-04-01T13:01:30Z) - Bias Similarity Measurement: A Black-Box Audit of Fairness Across LLMs [35.72288339965426]
Bias Similarity Measurement (BSM) treats fairness as a relational property between models.<n>Open-weight models can match or exceed proprietary systems.<n>BSM offers an auditing workflow for procurement, regression testing, and lineage screening.
arXiv Detail & Related papers (2024-10-15T19:21:14Z) - Covert Bias: The Severity of Social Views' Unalignment in Language Models Towards Implicit and Explicit Opinion [0.40964539027092917]
We evaluate the severity of bias toward a view by using a biased model in edge cases of excessive bias scenarios.
Our findings reveal a discrepancy in LLM performance in identifying implicit and explicit opinions, with a general tendency of bias toward explicit opinions of opposing stances.
The direct, incautious responses of the unaligned models suggest a need for further refinement of decisiveness.
arXiv Detail & Related papers (2024-08-15T15:23:00Z) - Spoken Stereoset: On Evaluating Social Bias Toward Speaker in Speech Large Language Models [50.40276881893513]
This study introduces Spoken Stereoset, a dataset specifically designed to evaluate social biases in Speech Large Language Models (SLLMs)
By examining how different models respond to speech from diverse demographic groups, we aim to identify these biases.
The findings indicate that while most models show minimal bias, some still exhibit slightly stereotypical or anti-stereotypical tendencies.
arXiv Detail & Related papers (2024-08-14T16:55:06Z) - Listen and Speak Fairly: A Study on Semantic Gender Bias in Speech Integrated Large Language Models [38.64792118903994]
We evaluate gender bias in SILLMs across four semantic-related tasks.
Our analysis reveals that bias levels are language-dependent and vary with different evaluation methods.
arXiv Detail & Related papers (2024-07-09T15:35:43Z) - GPTBIAS: A Comprehensive Framework for Evaluating Bias in Large Language
Models [83.30078426829627]
Large language models (LLMs) have gained popularity and are being widely adopted by a large user community.
The existing evaluation methods have many constraints, and their results exhibit a limited degree of interpretability.
We propose a bias evaluation framework named GPTBIAS that leverages the high performance of LLMs to assess bias in models.
arXiv Detail & Related papers (2023-12-11T12:02:14Z) - Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis [86.49858739347412]
Large Language Models (LLMs) have sparked intense debate regarding the prevalence of bias in these models and its mitigation.
We propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the decision process.
We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment.
arXiv Detail & Related papers (2023-11-15T00:02:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.