Related papers: CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

URL: http://arxiv.org/abs/2504.15485v1
Date: Mon, 21 Apr 2025 23:38:43 GMT
Title: CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
Authors: Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal,
Abstract summary: Counting Amodally for Patterns Through Unseen REgions (CAPTURe) is a testbed for evaluating vision-language models.<n>We evaluate four strong vision-language models on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns.
Score: 59.830657530592255
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty counting in images.

Related papers

VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs [18.349695067647012]
Visual Language Models excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple tests.<n>We present an evaluation that tests vision-language models' capacity for nonlocal visual reasoning.<n>Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
arXiv Detail & Related papers (2025-07-04T23:15:52Z)
ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs [98.27348724529257]
We introduce ViCrit (Visual Caption Hallucination Critic), an RL proxy task that trains VLMs to localize a subtle, synthetic visual hallucination injected into paragraphs of human-written image captions.<n>Models trained with the ViCrit Task exhibit substantial gains across a variety of vision-language models benchmarks.
arXiv Detail & Related papers (2025-06-11T19:16:54Z)
Unfolding Spatial Cognition: Evaluating Multimodal Models on Visual Simulations [61.235500325327585]
Existing AI benchmarks primarily assess verbal reasoning, neglecting the complexities of non-verbal, multi-step visual simulation.<n>We introduce STARE, a benchmark designed to rigorously evaluate multimodal large language models on tasks better solved through visual simulation.<n>Our evaluations show that models excel at reasoning over simpler 2D transformations, but perform close to random chance on more complex tasks.
arXiv Detail & Related papers (2025-06-05T05:09:46Z)
Towards a Systematic Evaluation of Hallucinations in Large-Vision Language Models [57.58426038241812]
Large Vision-Language Models (LVLMs) have demonstrated remarkable performance in complex multimodal tasks.<n>These models still suffer from hallucinations when required to implicitly recognize or infer diverse visual entities from images.<n>We propose a novel visual question answering (VQA) benchmark that employs contextual reasoning prompts as hallucination attacks.
arXiv Detail & Related papers (2024-12-29T23:56:01Z)
Rethinking Detecting Salient and Camouflaged Objects in Unconstrained Scenes [5.876510569597584]
We propose a model called USCNet to explicitly model the relationship between salient and camouflaged objects.<n>To assess the model's ability to distinguish between salient and camouflaged objects, we design an evaluation metric called CSCS.
arXiv Detail & Related papers (2024-12-14T19:37:17Z)
VisMin: Visual Minimal-Change Understanding [7.226130826257802]
We introduce a new, challenging benchmark termed Visual Minimal-Change Understanding (VisMin)<n>VisMin requires models to predict the correct image-caption match given two images and two captions.<n>We build an automatic framework using large language models and diffusion models, followed by a rigorous 4-step verification process by human annotators.
arXiv Detail & Related papers (2024-07-23T18:10:43Z)
AutoHallusion: Automatic Generation of Hallucination Benchmarks for Vision-Language Models [91.78328878860003]
Large vision-language models (LVLMs) are prone to hallucinations. benchmarks often rely on hand-crafted corner cases whose failure patterns may not generalize well. We develop AutoHallusion, the first automated benchmark generation approach.
arXiv Detail & Related papers (2024-06-16T11:44:43Z)
OLIVE: Object Level In-Context Visual Embeddings [8.168219870640318]
We propose a novel method to prompt large language models with in-context visual object vectors. This eliminates the necessity of fusing a lengthy array of image patch features and significantly speeds up training. Our experiments reveal that our method achieves competitive referring object classification and captioning performance.
arXiv Detail & Related papers (2024-06-02T21:36:31Z)
Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset. We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding. Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z)
Negative Object Presence Evaluation (NOPE) to Measure Object Hallucination in Vision-Language Models [67.8024390595066]
NOPE (Negative Object Presence Evaluation) is a novel benchmark designed to assess object hallucination in vision-language (VL) models. We extensively investigate the performance of 10 state-of-the-art VL models in discerning the non-existence of objects in visual questions.
arXiv Detail & Related papers (2023-10-09T01:52:27Z)
LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images. We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z)
LLM2Loss: Leveraging Language Models for Explainable Model Diagnostics [5.33024001730262]
We propose an approach that can provide semantic insights into a model's patterns of failures and biases. We show that an ensemble of such lightweight models can be used to generate insights on the performance of the black-box model.
arXiv Detail & Related papers (2023-05-04T23:54:37Z)
Benchmarking Spatial Relationships in Text-to-Image Generation [102.62422723894232]
We investigate the ability of text-to-image models to generate correct spatial relationships among objects. We present VISOR, an evaluation metric that captures how accurately the spatial relationship described in text is generated in the image. Our experiments reveal a surprising finding that, although state-of-the-art T2I models exhibit high image quality, they are severely limited in their ability to generate multiple objects or the specified spatial relations between them.
arXiv Detail & Related papers (2022-12-20T06:03:51Z)
Unsupervised Object Learning via Common Fate [61.14802390241075]
Learning generative object models from unlabelled videos is a long standing problem and required for causal scene modeling. We decompose this problem into three easier subtasks, and provide candidate solutions for each of them. We show that our approach allows learning generative models that generalize beyond the occlusions present in the input videos.
arXiv Detail & Related papers (2021-10-13T08:22:04Z)
Dependent Multi-Task Learning with Causal Intervention for Image Captioning [10.6405791176668]
In this paper, we propose a dependent multi-task learning framework with the causal intervention (DMTCI) Firstly, we involve an intermediate task, bag-of-categories generation, before the final task, image captioning. Secondly, we apply Pearl's do-calculus on the model, cutting off the link between the visual features and possible confounders. Finally, we use a multi-agent reinforcement learning strategy to enable end-to-end training and reduce the inter-task error accumulations.
arXiv Detail & Related papers (2021-05-18T14:57:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.