Related papers: ABCD: All Biases Come Disguised

ABCD: All Biases Come Disguised

URL: http://arxiv.org/abs/2602.17445v1
Date: Thu, 19 Feb 2026 15:12:33 GMT
Title: ABCD: All Biases Come Disguised
Authors: Mateusz Nowak, Xavier Cadet, Peter Chin,
Abstract summary: Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice.<n>We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels.<n>We show that this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3times$ with only a minimal decrease in the mean model's performance.
Score: 4.603755953026689
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice for measuring LLMs' ability to reason and answer knowledge-based questions. Through a synthetic NonsenseQA benchmark, we observe that different LLMs exhibit varying degrees of label-position-few-shot-prompt bias, where the model either uses the answer position, the label in front of the answer, the distributions of correct answers present in the few-shot prompt, or a combination of all to answer each MCQ question. We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels and prompts the LLM to use the whole answer presented. With a simple sentence similarity model, we demonstrate improved robustness and lower standard deviation between different permutations of answers with a minimal drop in LLM's performance, exposing the LLM's capabilities under reduced evaluation artifacts, without any help from the prompt examples or the option labels. Across multiple benchmarks and models, this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3\times$ with only a minimal decrease in the mean model's performance. Through ablation studies on various embedding models and similarity functions, we show that the method is more robust than the standard ones.

Related papers

Quantifying and Mitigating Selection Bias in LLMs: A Transferable LoRA Fine-Tuning and Efficient Majority Voting Approach [13.829059542429876]
Multiple Choice Question (MCQ) answering is a widely used method for evaluating the performance of Large Language Models (LLMs)<n>LLMs often exhibit selection bias in MCQ tasks, where their choices are influenced by factors like answer position or option symbols rather than the content.
arXiv Detail & Related papers (2025-11-17T21:31:37Z)
Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models [2.393011821499345]
We investigate the presence and nature of selection bias in Large Vision-Language Models (LVLMs)<n>We propose an inference-time logit-level debiasing method that estimates an ensemble bias vector from general and contextual prompts.<n>Our method mitigates bias without retraining and is compatible with frozen LVLMs.
arXiv Detail & Related papers (2025-09-20T20:45:47Z)
The Non-Determinism of Small LLMs: Evidence of Low Answer Consistency in Repetition Trials of Standard Multiple-Choice Benchmarks [0.013048920509133805]
We present a study on known, open-source LLMs responding to 10 repetitions of questions from the benchmarks MMLU-Redux and MedQA.<n>Results show that the number of questions which can be answered consistently vary considerably among models.<n>Results for medium-sized models seem to indicate much higher levels of answer consistency.
arXiv Detail & Related papers (2025-09-05T17:31:14Z)
CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward [50.97588334916863]
We develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward.<n>It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types.<n>We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier.
arXiv Detail & Related papers (2025-08-05T17:55:24Z)
Threading the Needle: Reweaving Chain-of-Thought Reasoning to Explain Human Label Variation [60.18907916989796]
Large Language Models (LLMs) generate chains of thought (CoTs) before giving the final answer.<n>We propose a novel pipeline enriched with linguistically-grounded discourse segmenters to extract supporting and opposing statements for each answer option.<n>We also propose a rank-based HLV evaluation framework that prioritizes the ranking of answers over exact scores.
arXiv Detail & Related papers (2025-05-29T11:47:18Z)
Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
Multiple-Choice Question Answering (MCQA) is widely used to evaluate Large Language Models (LLMs)<n>We show that multiple factors can significantly impact the reported performance of LLMs.<n>We analyze whether existing answer extraction methods are aligned with human judgment.
arXiv Detail & Related papers (2025-03-19T08:45:03Z)
LLMs Can Generate a Better Answer by Aggregating Their Own Responses [83.69632759174405]
Large Language Models (LLMs) have shown remarkable capabilities across tasks, yet they often require additional prompting techniques when facing complex problems.<n>We argue this limitation stems from the fact that common LLM post-training procedures lack explicit supervision for discriminative judgment tasks.<n>We propose Generative Self-Aggregation (GSA), a novel prompting method that improves answer quality without requiring the model's discriminative capabilities.
arXiv Detail & Related papers (2025-03-06T05:25:43Z)
Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models [16.252597615544317]
Video Language Models (VLMs) are designed to answer complex video-focused questions.<n>Current benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias.<n>This study is the first focused investigation of selection bias in video-to-text LLM-powered models.
arXiv Detail & Related papers (2024-10-18T07:52:22Z)
LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs [61.57691505683534]
Non-Factoid (NF) Question Answering (QA) is challenging to evaluate due to diverse potential answers and no objective criterion. Large Language Models (LLMs) have been resorted to for NFQA evaluation due to their compelling performance on various NLP tasks. We propose a novel listwise NFQA evaluation approach, that utilizes LLMs to rank candidate answers in a list of reference answers sorted by descending quality.
arXiv Detail & Related papers (2024-09-23T06:42:21Z)
Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs) This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias" We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z)

This list is automatically generated from the titles and abstracts of the papers in this site.