Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
- URL: http://arxiv.org/abs/2602.23351v1
- Date: Thu, 26 Feb 2026 18:54:06 GMT
- Title: Scale Can't Overcome Pragmatics: The Impact of Reporting Bias on Vision-Language Reasoning
- Authors: Amita Kamath, Jack Hessel, Khyathi Chandu, Jena D. Hwang, Kai-Wei Chang, Ranjay Krishna,
- Abstract summary: The lack of reasoning capabilities in Vision-Language Models has remained at the forefront of research discourse.<n>We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics.
- Score: 79.95774256444956
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The lack of reasoning capabilities in Vision-Language Models (VLMs) has remained at the forefront of research discourse. We posit that this behavior stems from a reporting bias in their training data. That is, how people communicate about visual content by default omits tacit information needed to supervise some types of reasoning; e.g., "at the game today!" is a more likely caption than "a photo of 37 people standing behind a field". We investigate the data underlying the popular VLMs OpenCLIP, LLaVA-1.5 and Molmo through the lens of theories from pragmatics, and find that reporting bias results in insufficient representation of four reasoning skills (spatial, temporal, negation, and counting), despite the corpora being of web-scale, and/or synthetically generated. With a set of curated benchmarks, we demonstrate that: (i) VLMs perform poorly on the aforementioned types of reasoning suppressed in the training data by reporting bias; (ii) contrary to popular belief, scaling data size, model size, and to multiple languages does not result in emergence of these skills by default; but, promisingly, (iii) incorporating annotations specifically collected to obtain tacit information is effective. Our findings highlight the need for more intentional training data curation methods, rather than counting on scale for emergence of reasoning capabilities.
Related papers
- Seeing to Act, Prompting to Specify: A Bayesian Factorization of Vision Language Action Policy [59.44168425139687]
BayesVLA is a Bayesian factorization that decomposes the policy into a visual-action prior, supporting seeing-to-act, and a language-conditioned likelihood, enabling prompt-to-specify.<n>Experiments show superior generalization to unseen instructions, objects, and environments compared to existing methods.
arXiv Detail & Related papers (2025-12-12T01:59:23Z) - Clarification as Supervision: Reinforcement Learning for Vision-Language Interfaces [14.074625212174494]
We propose Adaptive-Clarification Reinforcement Learning (AC-RL), which teaches vision models what information reasoners need through interaction.<n>AC-RL improves average accuracy by 4.4 points over pretrained baselines across seven visual mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-09-30T17:46:46Z) - Analyzing and Mitigating Object Hallucination: A Training Bias Perspective [108.09666587800781]
We introduce a new benchmark, POPEv2, which consists of counterfactual images collected from the training data of LVLMs with certain objects masked.<n>We find that current LVLMs suffer from training bias: they fail to fully leverage their training data and hallucinate more frequently on images seen during training.<n>We propose Obliviate, an efficient and lightweight unlearning method designed to mitigate object hallucination via training bias unlearning.
arXiv Detail & Related papers (2025-08-06T15:51:02Z) - Can Reasoning Help Large Language Models Capture Human Annotator Disagreement? [84.32752330104775]
Variation in human annotation (i.e., disagreements) is common in NLP.<n>We evaluate the influence of different reasoning settings on Large Language Model disagreement modeling.<n>Surprisingly, our results show that RLVR-style reasoning degrades performance in disagreement modeling.
arXiv Detail & Related papers (2025-06-24T09:49:26Z) - Counterfactual reasoning: an analysis of in-context emergence [57.118735341305786]
We show that language models are capable of counterfactual reasoning.<n>We find that self-attention, model depth and pre-training data diversity drive performance.<n>Our findings extend to counterfactual reasoning under SDE dynamics.
arXiv Detail & Related papers (2025-06-05T16:02:07Z) - Seeing Through Deception: Uncovering Misleading Creator Intent in Multimodal News with Vision-Language Models [65.23999399834638]
We introduce DeceptionDecoded, a benchmark of 12,000 image-caption pairs grounded in trustworthy reference articles.<n>The dataset captures both misleading and non-misleading cases, spanning manipulations across visual and textual modalities.<n>It supports three intent-centric tasks: misleading intent detection, misleading source attribution, and creator desire inference.
arXiv Detail & Related papers (2025-05-21T13:14:32Z) - Leveraging VLM-Based Pipelines to Annotate 3D Objects [68.51034848207355]
We propose an alternative algorithm to marginalize over factors such as the viewpoint that affect the VLM's response.
Instead of merging text-only responses, we utilize the VLM's joint image-text likelihoods.
We show how a VLM-based pipeline can be leveraged to produce reliable annotations for 764K objects from the 764K dataset.
arXiv Detail & Related papers (2023-11-29T17:54:22Z) - Probing LLMs for hate speech detection: strengths and vulnerabilities [8.626059038321724]
We utilise different prompt variation, input information and evaluate large language models in zero shot setting.
We select three large language models (GPT-3.5, text-davinci and Flan-T5) and three datasets - HateXplain, implicit hate and ToxicSpans.
We find that on average including the target information in the pipeline improves the model performance substantially.
arXiv Detail & Related papers (2023-10-19T16:11:02Z) - Enhance Reasoning Ability of Visual-Language Models via Large Language
Models [7.283533791778359]
We propose a method called TReE, which transfers the reasoning ability of a large language model to a visual language model in zero-shot scenarios.
TReE contains three stages: observation, thinking, and re-thinking.
arXiv Detail & Related papers (2023-05-22T17:33:44Z) - Do Language Embeddings Capture Scales? [54.1633257459927]
We show that pretrained language models capture a significant amount of information about the scalar magnitudes of objects.
We identify contextual information in pre-training and numeracy as two key factors affecting their performance.
arXiv Detail & Related papers (2020-10-11T21:11:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.