Related papers: Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving

Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving

URL: http://arxiv.org/abs/2602.17677v1
Date: Wed, 28 Jan 2026 20:30:26 GMT
Title: Reducing Text Bias in Synthetically Generated MCQAs for VLMs in Autonomous Driving
Authors: Sutej Kulgod, Sean Ye, Sanchit Tanwar, Christoffer Heckman,
Abstract summary: Multiple Choice Question Answering (MCQA) benchmarks are an established standard for measuring Vision Language Model (VLM) performance in driving tasks.<n>We show that a VLM fine-tuned on such data can achieve accuracy comparable to human-validated benchmarks even without visual input.
Score: 1.6039614357284375
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multiple Choice Question Answering (MCQA) benchmarks are an established standard for measuring Vision Language Model (VLM) performance in driving tasks. However, we observe the known phenomenon that synthetically generated MCQAs are highly susceptible to hidden textual cues that allow models to exploit linguistic patterns rather than visual context. Our results show that a VLM fine-tuned on such data can achieve accuracy comparable to human-validated benchmarks even without visual input. Our proposed method reduces blind accuracy from +66.9% above random to +2.9%, eliminating the vast majority of exploitable textual shortcuts. By decoupling the correct answer from linguistic artifacts and employing a curriculum learning strategy, we force the model to rely on visual grounding, ensuring that performance accurately reflects perceptual understanding.

Related papers

Evaluating the encoding competence of visual language models using uncommon actions [5.816389980109022]
UAIT is a new evaluation benchmark designed to test the semantic understanding ability of visual language models (VLMs) in uncommon-sense action scenes.<n>We synthesize high-quality uncommon-sense image-text samples using large language models, few-shot prompt engineering, and text-to-image generation.<n>We evaluate multiple state-of-the-art visual language models and compare them with models based on contrastive learning.
arXiv Detail & Related papers (2026-01-12T17:15:45Z)
Benchmarking and Mitigating MCQA Selection Bias of Large Vision-Language Models [2.393011821499345]
We investigate the presence and nature of selection bias in Large Vision-Language Models (LVLMs)<n>We propose an inference-time logit-level debiasing method that estimates an ensemble bias vector from general and contextual prompts.<n>Our method mitigates bias without retraining and is compatible with frozen LVLMs.
arXiv Detail & Related papers (2025-09-20T20:45:47Z)
Quality-Driven Curation of Remote Sensing Vision-Language Data via Learned Scoring Models [11.114790704621427]
Vision-Language Models (VLMs) have demonstrated great potential in interpreting remote sensing (RS) images through language-guided semantic.<n>We propose a novel score model trained on large-scale RS vision-language preference data for automated quality assessment.<n>Our empirical results demonstrate that fine-tuning CLIP or advanced VLMs with the top 30% of data ranked by our score model achieves superior accuracy compared to both full-data fine-tuning and CLIP-score-based ranking approaches.
arXiv Detail & Related papers (2025-03-02T05:44:56Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Calibrated Self-Rewarding Vision Language Models [27.686545023186852]
Large Vision-Language Models (LVLMs) have made substantial progress by integrating pre-trained large language models (LLMs) and vision models through instruction tuning. LVLMs often exhibit the hallucination phenomenon, where generated text responses appear linguistically plausible but contradict the input image. We propose the Calibrated Self-Rewarding (CSR) approach, which enables the model to self-improve by iteratively generating candidate responses, evaluating the reward for each response, and curating preference data for fine-tuning.
arXiv Detail & Related papers (2024-05-23T14:30:33Z)
Multi-Modal Prompt Learning on Blind Image Quality Assessment [65.0676908930946]
Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. This paper introduces an innovative multi-modal prompt-based methodology for IQA.
arXiv Detail & Related papers (2024-04-23T11:45:32Z)
Evaluating Generative Language Models in Information Extraction as Subjective Question Correction [49.729908337372436]
We propose a new evaluation method, SQC-Score. Inspired by the principles in subjective question correction, we propose a new evaluation method, SQC-Score. Results on three information extraction tasks show that SQC-Score is more preferred by human annotators than the baseline metrics.
arXiv Detail & Related papers (2024-04-04T15:36:53Z)
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model. We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z)
The Devil is in the Errors: Leveraging Large Language Models for Fine-grained Machine Translation Evaluation [93.01964988474755]
AutoMQM is a prompting technique which asks large language models to identify and categorize errors in translations. We study the impact of labeled data through in-context learning and finetuning. We then evaluate AutoMQM with PaLM-2 models, and we find that it improves performance compared to just prompting for scores.
arXiv Detail & Related papers (2023-08-14T17:17:21Z)
Revisiting the Role of Language Priors in Vision-Language Models [90.0317841097143]
Vision-language models (VLMs) are applied to a variety of visual understanding tasks in a zero-shot fashion, without any fine-tuning. We study $textitgenerative VLMs$ that are trained for next-word generation given an image. We explore their zero-shot performance on the illustrative task of image-text retrieval across 8 popular vision-language benchmarks.
arXiv Detail & Related papers (2023-06-02T19:19:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.