Related papers: Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

URL: http://arxiv.org/abs/2511.17722v1
Date: Fri, 21 Nov 2025 19:18:41 GMT
Title: Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions
Authors: Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, Donald E. Brown,
Abstract summary: Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images.<n>We build upon this research by developing a synthetic benchmark dataset and evaluation framework to determine how counting performance varies as image and prompt properties change.<n>We implement attention-based interventions to focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions.
Score: 0.4934817254755008
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require them to focus on particular areas of the image in tasks such as counting. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically determine how counting performance varies as image and prompt properties change. Using open-source VLMs, we then analyze how attention allocation fluctuates with varying input parameters (e.g. number of objects in the image, objects color, background color, objects texture, background texture, and prompt specificity). We further implement attention-based interventions to modulate focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions. Our experiments reveal that while VLM counting performance remains challenging, especially under high visual or linguistic complexity, certain attention interventions can lead to modest gains in counting performance.

Related papers

Examining Vision Language Models through Multi-dimensional Experiments with Vision and Text Features [0.4934817254755008]
Vision Language Models (VLMs) rely on inherent biases learned during training to respond to questions about visual properties of an image.<n>This research aims to learn how the behavior of vision language models changes and to explore methods for characterizing such changes.
arXiv Detail & Related papers (2025-09-10T03:49:40Z)
Focusing by Contrastive Attention: Enhancing VLMs' Visual Reasoning [79.34909830834464]
Vision-Language Models (VLMs) have demonstrated remarkable success across diverse visual tasks, yet their performance degrades in complex visual environments.<n>We show that visual complexity strongly correlates with attention entropy, negatively impacting reasoning performance.<n>We propose Contrastive Attention Refinement for Visual Enhancement (CARVE), a training-free method that extracts task-relevant visual signals through attention contrasting at the pixel level.
arXiv Detail & Related papers (2025-09-08T09:20:04Z)
Same Task, Different Circuits: Disentangling Modality-Specific Mechanisms in VLMs [49.42020616826156]
Vision-Language models (VLMs) show impressive abilities to answer questions on visual inputs, yet demonstrate higher accuracies when performing an analogous task on text.<n>We investigate this accuracy gap by identifying and comparing the textitcircuits in different modalities.<n>To overcome this, we patch the representations of visual data tokens from later layers back into earlier layers.
arXiv Detail & Related papers (2025-06-10T17:59:21Z)
BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models [2.526146573337397]
We propose a new evaluation methodology, inspired by ophthalmologic diagnostics.<n>We use procedural generation of synthetic images to obtain control over visual attributes.<n>This diagnostic allows systematic stress testing and fine-grained failure analysis.
arXiv Detail & Related papers (2025-06-05T12:43:10Z)
Vision language models are unreliable at trivial spatial cognition [0.2902243522110345]
Vision language models (VLMs) are designed to extract relevant visuospatial information from images.<n>We develop a benchmark dataset -- TableTest -- whose images depict 3D scenes of objects arranged on a table, and used it to evaluate state-of-the-art VLMs.<n>Results show that performance could be degraded by minor variations of prompts that use equivalent descriptions.
arXiv Detail & Related papers (2025-04-22T17:38:01Z)
V-MAGE: A Game Evaluation Framework for Assessing Vision-Centric Capabilities in Multimodal Large Language Models [84.27290155010533]
We introduce Vision-centric Multiple Abilities Game Evaluation (V-MAGE), a novel game-based evaluation framework.<n>V-MAGE features five distinct video games comprising over 30 carefully constructed evaluation scenarios.<n>We show V-MAGE provides actionable insights for improving the visual and reasoning capabilities of MLLMs in dynamic, interactive settings.
arXiv Detail & Related papers (2025-04-08T15:43:01Z)
Teaching VLMs to Localize Specific Objects from In-context Examples [56.797110842152]
We find that present-day Vision-Language Models (VLMs) lack a fundamental cognitive ability: learning to localize specific objects in a scene by taking into account the context.<n>This work is the first to explore and benchmark personalized few-shot localization for VLMs.
arXiv Detail & Related papers (2024-11-20T13:34:22Z)
VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models [19.291697178628546]
Vision language models (VLMs) have shown promising reasoning capabilities across various benchmarks. In this work, we propose an eye examination process to investigate how a VLM perceives images.
arXiv Detail & Related papers (2024-09-23T07:15:29Z)
Losing Visual Needles in Image Haystacks: Vision Language Models are Easily Distracted in Short and Long Contexts [65.04791072532106]
We present LoCoVQA, a benchmark generator for evaluating long-context extractive reasoning in vision language models (VLMs) LoCoVQA augments test examples for mathematical reasoning, VQA, and character recognition tasks with increasingly long visual contexts. This test assesses how well VLMs can ignore irrelevant information when answering queries.
arXiv Detail & Related papers (2024-06-24T17:58:03Z)
Task Formulation Matters When Learning Continually: A Case Study in Visual Question Answering [58.82325933356066]
Continual learning aims to train a model incrementally on a sequence of tasks without forgetting previous knowledge. We present a detailed study of how different settings affect performance for Visual Question Answering.
arXiv Detail & Related papers (2022-09-30T19:12:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.