Mitigating Easy Option Bias in Multiple-Choice Question Answering
- URL: http://arxiv.org/abs/2508.13428v1
- Date: Tue, 19 Aug 2025 01:03:45 GMT
- Title: Mitigating Easy Option Bias in Multiple-Choice Question Answering
- Authors: Hao Zhang, Chen Li, Basura Fernando,
- Abstract summary: We observe an Easy-Options Bias (EOB) issue in some multiple-choice Visual Question Answering (VQA) benchmarks.<n>This bias allows vision-language models (VLMs) to select the correct answer using only the vision (V) and options (O) as inputs.<n>We introduce GroundAttack, a toolkit that automatically generates hard negative options as visually plausible as the correct answer.
- Score: 19.102900548627638
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this early study, we observe an Easy-Options Bias (EOB) issue in some multiple-choice Visual Question Answering (VQA) benchmarks such as MMStar, RealWorldQA, SEED-Bench, Next-QA, STAR benchmark and Video-MME. This bias allows vision-language models (VLMs) to select the correct answer using only the vision (V) and options (O) as inputs, without the need for the question (Q). Through grounding experiments, we attribute the bias to an imbalance in visual relevance: the correct answer typically aligns more closely with the visual contents than the negative options in feature space, creating a shortcut for VLMs to infer the answer via simply vision-option similarity matching. To fix this, we introduce GroundAttack, a toolkit that automatically generates hard negative options as visually plausible as the correct answer. We apply it to the NExT-QA and MMStar datasets, creating new EOB-free annotations. On these EOB-free annotations, current VLMs approach to random accuracies under (V+O) settings, and drop to non-saturated accuracies under (V+Q+O) settings, providing a more realistic evaluation of VLMs' QA ability. Codes and new annotations will be released soon.
Related papers
- Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models [2.393011821499345]
We investigate the presence and nature of selection bias in Large Vision-Language Models (LVLMs)<n>We propose an inference-time logit-level debiasing method that estimates an ensemble bias vector from general and contextual prompts.<n>Our method mitigates bias without retraining and is compatible with frozen LVLMs.
arXiv Detail & Related papers (2025-09-20T20:45:47Z) - Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation [69.81654421834989]
We introduce Auto, an agentic framework that automatically converts open-ended questions into multiple-choice format.<n>Our experiments demonstrate that Auto can correct and challenging multiple-choice questions, with similar or lower accuracy compared to human-created ones.<n>We comprehensively evaluate 33 state-of-the-art vision language models (VLMs) on VMCBench, setting a new standard for scalable, consistent, and reproducible VLM evaluation.
arXiv Detail & Related papers (2025-01-06T18:57:31Z) - Addressing Blind Guessing: Calibration of Selection Bias in Multiple-Choice Question Answering by Video Language Models [16.252597615544317]
Video Language Models (VLMs) are designed to answer complex video-focused questions.<n>Current benchmarks fail to capture the full reasoning capabilities of VLMs due to selection bias.<n>This study is the first focused investigation of selection bias in video-to-text LLM-powered models.
arXiv Detail & Related papers (2024-10-18T07:52:22Z) - Trust but Verify: Programmatic VLM Evaluation in the Wild [62.14071929143684]
Programmatic VLM Evaluation (PROVE) is a new benchmarking paradigm for evaluating VLM responses to open-ended queries.
We benchmark the helpfulness-truthfulness trade-offs of a range ofVLMs on PROVE, finding that very few are in-fact able to achieve a good balance between the two.
arXiv Detail & Related papers (2024-10-17T01:19:18Z) - Multimodal Rationales for Explainable Visual Question Answering [12.893224628061516]
Visual Question Answering (VQA) is a challenging task of predicting the answer to a question about the content of an image.<n>We propose a novel model termed MRVQA, which provides visual and textual rationales to support its predicted answers.<n>MRVQA achieves new state-of-the-art results through additional rationale generation, enhancing the trustworthiness of the model.
arXiv Detail & Related papers (2024-02-06T11:07:05Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - Large Language Models Are Not Robust Multiple Choice Selectors [117.72712117510953]
Multiple choice questions (MCQs) serve as a common yet important task format in the evaluation of large language models (LLMs)
This work shows that modern LLMs are vulnerable to option position changes due to their inherent "selection bias"
We propose a label-free, inference-time debiasing method, called PriDe, which separates the model's prior bias for option IDs from the overall prediction distribution.
arXiv Detail & Related papers (2023-09-07T17:44:56Z) - Can I Trust Your Answer? Visually Grounded Video Question Answering [88.11169242115416]
We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video-language understanding.
We construct NExT-GQA -- an extension of NExT-QA with 10.5$K$ temporal grounding labels tied to the original QA pairs.
arXiv Detail & Related papers (2023-09-04T03:06:04Z) - Zero-Shot Video Question Answering via Frozen Bidirectional Language
Models [89.71617065426146]
Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training.
Recent methods consider zero-shot settings with no manual annotation of visual question-answer.
We build on frozen autoregressive language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA.
arXiv Detail & Related papers (2022-06-16T13:18:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.