Related papers: Hearing the Order: Investigating Selection Bias in Large Audio-Language Models

Hearing the Order: Investigating Selection Bias in Large Audio-Language Models

URL: http://arxiv.org/abs/2510.00628v1
Date: Wed, 01 Oct 2025 08:00:58 GMT
Title: Hearing the Order: Investigating Selection Bias in Large Audio-Language Models
Authors: Yu-Xiang Lin, Chen-An Li, Sheng-Lun Wei, Po-Chun Chen, Hsin-Hsi Chen, Hung-yi Lee,
Abstract summary: Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options.<n>In this paper, we identify and analyze this problem in LALMs.
Score: 51.69003519291754
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large audio-language models (LALMs) are often used in tasks that involve reasoning over ordered options. An open question is whether their predictions are influenced by the order of answer choices, which would indicate a form of selection bias and undermine their reliability. In this paper, we identify and analyze this problem in LALMs. We demonstrate that no model is immune to this bias through extensive experiments on six LALMs across three widely used benchmarks and their spoken counterparts. Shuffling the order of answer options can cause performance fluctuations of up to 24% and even change model rankings, raising concerns about the reliability of current evaluation practices. We also study permutation-based strategies and show that they can mitigate bias in most cases. Our work represents the first systematic investigation of this issue in LALMs, and we hope it raises awareness and motivates further research in this direction.

Related papers

Exploiting Primacy Effect To Improve Large Language Models [1.03590082373586]
This study focuses on primacy bias in fine-tuned Large Language Models (LLMs)<n>We first show that fine-tuning amplifies this bias, probably due to exposure to human-like patterns.<n>We strategically leverage this effect by reordering response options based on semantic similarity to the query, without requiring knowledge of the correct answer.
arXiv Detail & Related papers (2025-07-18T14:18:18Z)
Self-ensemble: Mitigating Confidence Mis-calibration for Large Language Models [67.62810111789338]
Large Language Models exhibit a confidence distortion problem on multi-choice question-answering.<n>We propose Self-ensemble to solve this problem.<n> Experimental results on three LLMs and datasets demonstrate that Self-ensemble comprehensively addresses the confidence distortion problem.
arXiv Detail & Related papers (2025-06-02T17:59:29Z)
Judging with Many Minds: Do More Perspectives Mean Less Prejudice? On Bias Amplifications and Resistance in Multi-Agent Based LLM-as-Judge [70.89799989428367]
We conduct a systematic analysis of four diverse bias types: position bias, verbosity bias, chain-of-thought bias, and bandwagon bias.<n>We evaluate these biases across two widely adopted multi-agent LLM-as-Judge frameworks: Multi-Agent-Debate and LLM-as-Meta-Judge.
arXiv Detail & Related papers (2025-05-26T03:56:41Z)
Diverging Preferences: When do Annotators Disagree and do Models Know? [92.24651142187989]
We develop a taxonomy of disagreement sources spanning 10 categories across four high-level classes. We find that the majority of disagreements are in opposition with standard reward modeling approaches. We develop methods for identifying diverging preferences to mitigate their influence on evaluation and training.
arXiv Detail & Related papers (2024-10-18T17:32:22Z)
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models [16.252597615544317]
Video Language Models (VLMs) are designed to answer complex video-focused questions.<n>Current benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias.<n>This study is the first focused investigation of selection bias in video-to-text LLM-powered models.
arXiv Detail & Related papers (2024-10-18T07:52:22Z)
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge [84.34545223897578]
Despite their excellence in many domains, potential issues are under-explored, undermining their reliability and the scope of their utility. We identify 12 key potential biases and propose a new automated bias quantification framework-CALM- which quantifies and analyzes each type of bias in LLM-as-a-Judge. Our work highlights the need for stakeholders to address these issues and remind users to exercise caution in LLM-as-a-Judge applications.
arXiv Detail & Related papers (2024-10-03T17:53:30Z)
Evaluating Nuanced Bias in Large Language Model Free Response Answers [8.775925011558995]
We identify several kinds of nuanced bias in free text that cannot be identified by multiple choice tests. We present a semi-automated pipeline for detecting these types of bias by first eliminating answers that can be automatically classified as unbiased.
arXiv Detail & Related papers (2024-07-11T19:58:13Z)
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z)
Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs) This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias" We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z)
Large Language Models Sensitivity to The Order of Options in Multiple-Choice Questions [5.187383020960245]
Large Language Models (LLMs) have demonstrated remarkable capabilities in various NLP tasks. Previous works have shown these models are sensitive towards prompt wording, and few-shot demonstrations and their order. This paper investigates sensitivity of LLMs towards the order of options in multiple-choice questions.
arXiv Detail & Related papers (2023-08-22T14:54:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.