Unexplored flaws in multiple-choice VQA evaluations
- URL: http://arxiv.org/abs/2511.22341v1
- Date: Thu, 27 Nov 2025 11:25:38 GMT
- Title: Unexplored flaws in multiple-choice VQA evaluations
- Authors: Fabio Rosenthal, Sebastian Schmidt, Thorsten Graf, Thorsten Bagodonat, Stephan Günnemann, Leo Schwinn,
- Abstract summary: Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs.<n>A common way to assess this ability is through multiple-choice Visual Question Answering (VQA)<n>We highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations.
- Score: 42.62741466222976
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) demonstrate strong capabilities in handling image-text inputs. A common way to assess this ability is through multiple-choice Visual Question Answering (VQA). Earlier works have already revealed that these benchmarks are sensitive to answer choice order, a limitation that can be mitigated through careful design. Yet, we highlight additional, unexplored biases in prompt formatting that question the reliability of current MLLM evaluations. Specifically, we identify three key variation factors in prompt formatting and analyze their impact through a large-scale study involving $\mathbf{\text{seven}}$ MLLMs and $\mathbf{\text{five}}$ VQA datasets, spanning $\mathbf{48}$ distinct $\mathbf{\text{prompt format variations}}$. Our findings reveal that multiple-choice VQA is highly sensitive to minor prompt format changes, even when these changes are semantically neutral. We further demonstrate that these biases persist independently of known order biases or the MLLM's confidence in the correct answer. Finally, we demonstrate that existing bias mitigation strategies fail to address these newly identified biases.
Related papers
- ABCD: All Biases Come Disguised [4.603755953026689]
Multiple-choice question (MCQ) benchmarks have been a standard evaluation practice.<n>We propose a simple bias-reduced evaluation protocol that replaces the labels of each question with uniform, unordered labels.<n>We show that this protocol substantially improves the robustness to answer permutations, reducing mean accuracy variance $3times$ with only a minimal decrease in the mean model's performance.
arXiv Detail & Related papers (2026-02-19T15:12:33Z) - Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models [2.393011821499345]
We investigate the presence and nature of selection bias in Large Vision-Language Models (LVLMs)<n>We propose an inference-time logit-level debiasing method that estimates an ensemble bias vector from general and contextual prompts.<n>Our method mitigates bias without retraining and is compatible with frozen LVLMs.
arXiv Detail & Related papers (2025-09-20T20:45:47Z) - Revisiting LLM Value Probing Strategies: Are They Robust and Expressive? [81.49470136653665]
We evaluate the robustness and expressiveness of value representations across three widely used probing strategies.<n>We show that the demographic context has little effect on the free-text generation, and the models' values only weakly correlate with their preference for value-based actions.
arXiv Detail & Related papers (2025-07-17T18:56:41Z) - Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering [78.89231943329885]
Multiple-Choice Question Answering (MCQA) is widely used to evaluate Large Language Models (LLMs)<n>We show that multiple factors can significantly impact the reported performance of LLMs.<n>We analyze whether existing answer extraction methods are aligned with human judgment.
arXiv Detail & Related papers (2025-03-19T08:45:03Z) - Answer, Assemble, Ace: Understanding How LMs Answer Multiple Choice Questions [103.20281438405111]
Multiple-choice question answering (MCQA) is a key competence of performant transformer language models.<n>We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information for predicting the correct answer.<n>We show that subsequent layers increase the probability of the predicted answer symbol in vocabulary space, and that this probability increase is associated with a sparse set of attention heads with unique roles.
arXiv Detail & Related papers (2024-07-21T00:10:23Z) - FSM: A Finite State Machine Based Zero-Shot Prompting Paradigm for Multi-Hop Question Answering [26.398873686905063]
Large Language Models (LLMs) with chain-of-thought (COT) prompting have demonstrated impressive abilities on simple nature language inference tasks.
We propose a prompting method, Finite State Machine (FSM) to enhance the reasoning capabilities of LLM for complex tasks.
arXiv Detail & Related papers (2024-07-03T10:01:01Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Forward-Backward Reasoning in Large Language Models for Mathematical Verification [65.9495774606273]
Self-Consistency samples diverse reasoning chains with answers and chooses the final answer by majority voting.
We introduce backward reasoning to verify candidate answers.
FOrward and BAckward Reasoning for verification achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-08-15T13:19:59Z) - Increasing Probability Mass on Answer Choices Does Not Always Improve
Accuracy [60.18632773935895]
Spreading probability mass across multiple surface forms with identical meaning is thought to cause an underestimation of a model's true performance.
We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time.
We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example.
arXiv Detail & Related papers (2023-05-24T00:27:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.