Can VLMs Recall Factual Associations From Visual References?
- URL: http://arxiv.org/abs/2508.18297v1
- Date: Fri, 22 Aug 2025 16:47:37 GMT
- Title: Can VLMs Recall Factual Associations From Visual References?
- Authors: Dhananjay Ashok, Ashutosh Chaubey, Hirona J. Arai, Jonathan May, Jesse Thomason,
- Abstract summary: We identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs)<n>Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge.<n>We show that such linking failures are correlated with the expression of distinct patterns in model internal states.
- Score: 30.821053378797007
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Through a controlled study, we identify a systematic deficiency in the multimodal grounding of Vision Language Models (VLMs). While VLMs can recall factual associations when provided a textual reference to an entity; their ability to do so is significantly diminished when the reference is visual instead. Forcing VLMs to rely on image representations of an entity halves their ability to recall factual knowledge, suggesting that VLMs struggle to link their internal knowledge of an entity with its image representation. We show that such linking failures are correlated with the expression of distinct patterns in model internal states, and that probes on these internal states achieve over 92% accuracy at flagging cases where the VLM response is unreliable. These probes can be applied, without retraining, to identify when a VLM will fail to correctly answer a question that requires an understanding of multimodal input. When used to facilitate selective prediction on a visual question answering task, the probes increase coverage by 7.87% (absolute) while also reducing the risk of error by 0.9% (absolute). Addressing the systematic, detectable deficiency is an important avenue in language grounding, and we provide informed recommendations for future directions.
Related papers
- Unbiased Visual Reasoning with Controlled Visual Inputs [28.155426761798022]
VISTA is a framework that decouples perception from reasoning via an explicit information bottleneck.<n>A frozen VLM sensor is restricted to short, objective perception queries.<n>A text-only LLM reasoner decomposes each question, plans queries, and aggregates visual facts in natural language.
arXiv Detail & Related papers (2025-12-19T18:52:06Z) - Seeing but Not Believing: Probing the Disconnect Between Visual Attention and Answer Correctness in VLMs [72.8370367403852]
Vision-Language Models (VLMs) achieve strong results on multimodal tasks such as visual question answering, yet they can still fail even when the correct visual evidence is present.<n>We show that shallow layers focus primarily on text, while deeper layers sparsely but reliably attend to localized evidence regions.<n>We introduce an inference-time intervention that highlights deep-layer evidence regions through selective attention-based masking.
arXiv Detail & Related papers (2025-10-20T17:31:09Z) - Hidden in plain sight: VLMs overlook their visual representations [48.83628674170634]
We compare vision language models (VLMs) to their visual encoders to understand their ability to integrate across these modalities.<n>We find that VLMs perform substantially worse than their visual encoders, dropping to near-chance performance.
arXiv Detail & Related papers (2025-06-09T17:59:54Z) - Right this way: Can VLMs Guide Us to See More to Answer Questions? [11.693356269848517]
In question-answering scenarios, humans assess whether the available information is sufficient and seek additional information if necessary.
In contrast, Vision Language Models (VLMs) typically generate direct, one-shot responses without evaluating the sufficiency of the information.
This study demonstrates the potential to narrow the gap between information assessment and acquisition in VLMs, bringing their performance closer to humans.
arXiv Detail & Related papers (2024-11-01T06:43:54Z) - MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection [107.15164718585666]
We investigate the root cause of VLMs' biased prediction under the open vocabulary detection context.
Our observations lead to a simple yet effective paradigm, coded MarvelOVD, that generates significantly better training targets.
Our method outperforms the other state-of-the-arts by significant margins.
arXiv Detail & Related papers (2024-07-31T09:23:57Z) - BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models [20.697019266074747]
Vision language models (VLMs) perceive the world through a combination of a visual encoder and a large language model (LLM)
Recent studies show that VLMs are vulnerable to hallucination.
We introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID)
arXiv Detail & Related papers (2024-07-18T12:11:12Z) - Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves? [61.899791071654654]
We investigate whether Vision-Language Models (VLMs) can improve their semantic grounding by "receiving" feedback.<n>We find that if prompted appropriately, VLMs can utilize feedback both in a single step and iteratively.<n>We show grounding accuracy consistently improves using automated feedback across all models in all settings investigated.
arXiv Detail & Related papers (2024-04-09T17:59:04Z) - LLMs' Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements [59.71218039095155]
Task of reading comprehension (RC) provides a primary means to assess language models' natural language understanding (NLU) capabilities.<n>If the context aligns with the models' internal knowledge, it is hard to discern whether the models' answers stem from context comprehension or from internal information.<n>To address this issue, we suggest to use RC on imaginary data, based on fictitious facts and entities.
arXiv Detail & Related papers (2024-04-09T13:08:56Z) - ViCor: Bridging Visual Understanding and Commonsense Reasoning with Large Language Models [27.5219975853389]
We find that pre-trained vision-and-language models (VLMs) and large language models (LLMs) are good at different kinds of visual commonsense reasoning problems.
For problems where the goal is to infer conclusions beyond image content,VLMs face difficulties, while LLMs, given sufficient visual evidence, can use commonsense to infer the answer well.
arXiv Detail & Related papers (2023-10-09T17:10:35Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.