Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports
- URL: http://arxiv.org/abs/2507.05799v1
- Date: Tue, 08 Jul 2025 09:06:47 GMT
- Title: Bridging Perception and Language: A Systematic Benchmark for LVLMs' Understanding of Amodal Completion Reports
- Authors: Amane Watahiki, Tomoki Doi, Taiga Shinozaki, Satoshi Nishida, Takuya Niikawa, Katsunori Miyahara, Hitomi Yanaka,
- Abstract summary: We study the inferential abilities of large vision-language models on texts related to amodal completion.<n>Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed.<n>Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.
- Score: 10.925743866700037
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: One of the main objectives in developing large vision-language models (LVLMs) is to engineer systems that can assist humans with multimodal tasks, including interpreting descriptions of perceptual experiences. A central phenomenon in this context is amodal completion, in which people perceive objects even when parts of those objects are hidden. Although numerous studies have assessed whether computer-vision algorithms can detect or reconstruct occluded regions, the inferential abilities of LVLMs on texts related to amodal completion remain unexplored. To address this gap, we constructed a benchmark grounded in Basic Formal Ontology to achieve a systematic classification of amodal completion. Our results indicate that while many LVLMs achieve human-comparable performance overall, their accuracy diverges for certain types of objects being completed. Notably, in certain categories, some LLaVA-NeXT variants and Claude 3.5 Sonnet exhibit lower accuracy on original images compared to blank stimuli lacking visual content. Intriguingly, this disparity emerges only under Japanese prompting, suggesting a deficiency in Japanese-specific linguistic competence among these models.
Related papers
- Beyond the Visible: Benchmarking Occlusion Perception in Multimodal Large Language Models [17.922450921582794]
Occlusion perception is a critical foundation for human-level spatial understanding.<n>We introduce O-Bench, the first visual question answering (VQA) benchmark specifically designed for occlusion perception.
arXiv Detail & Related papers (2025-08-06T03:39:21Z) - LLMs Are Not Yet Ready for Deepfake Image Detection [8.364956401923108]
Vision-language models (VLMs) have emerged as promising tools across various domains.<n>This study focuses on three primary deepfake types: faceswap, reenactment, and synthetic generation.<n>Our analysis indicates that while VLMs can produce coherent explanations and detect surface-level anomalies, they are not yet dependable as standalone detection systems.
arXiv Detail & Related papers (2025-06-12T08:27:24Z) - VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories.<n>These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives.<n>Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z) - Benchmarking Large Vision-Language Models on Fine-Grained Image Tasks: A Comprehensive Evaluation [53.84282335629258]
We introduce a comprehensive fine-grained evaluation benchmark, i.e., FG-BMK, comprising 1.01 million questions and 0.33 million images.<n>Our evaluation systematically examines LVLMs from both human-oriented and machine-oriented perspectives.<n>We uncover key findings regarding the influence of training paradigms, modality alignment, perturbation susceptibility, and fine-grained category reasoning on task performance.
arXiv Detail & Related papers (2025-04-21T09:30:41Z) - InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models [30.986157664865534]
We introduce, for the first time, a comprehensive, multi-level Chinese-based benchmark for evaluating the understanding of implicit meanings in images.<n>This benchmark is systematically categorized into four subtasks: surface-level content understanding, symbolic meaning interpretation, background knowledge comprehension, and implicit meaning comprehension.<n>Using this benchmark, we evaluate 15 open-source large vision language models (LVLMs) and GPT-4o, revealing that even the best-performing model lags behind human performance by nearly 14% in understanding implicit meaning.
arXiv Detail & Related papers (2025-02-19T13:42:37Z) - Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models [57.95366341738857]
In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept.<n>We propose a multiple attribute-centric evaluation benchmark, Finer, to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.
arXiv Detail & Related papers (2024-02-26T05:43:51Z) - CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models [58.95889895912716]
We introduce a new benchmark, named as CODIS, designed to assess the ability of models to use context provided in free-form text to enhance visual comprehension.
Our findings indicate that MLLMs consistently fall short of human performance on this benchmark.
This underscores the pressing need to enhance the ability of MLLMs to comprehend visuals in a context-dependent manner.
arXiv Detail & Related papers (2024-02-21T08:21:12Z) - Can Large Language Models Understand Context? [17.196362853457412]
This paper introduces a context understanding benchmark by adapting existing datasets to suit the evaluation of generative models.
Experimental results indicate that pre-trained dense models struggle with understanding more nuanced contextual features when compared to state-of-the-art fine-tuned models.
As LLM compression holds growing significance in both research and real-world applications, we assess the context understanding of quantized models under in-context-learning settings.
arXiv Detail & Related papers (2024-02-01T18:55:29Z) - Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models [50.653838482083614]
This paper introduces a scalable test-bed to assess the capabilities of IT-LVLMs on fundamental computer vision tasks.<n> MERLIM contains over 300K image-question pairs and has a strong focus on detecting cross-modal "hallucination" events in IT-LVLMs.
arXiv Detail & Related papers (2023-12-03T16:39:36Z) - Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds [59.71218039095155]
We evaluate language understanding capacities on simple inference tasks that most humans find trivial.
We target (i) grammatically-specified entailments, (ii) premises with evidential adverbs of uncertainty, and (iii) monotonicity entailments.
The models exhibit moderate to low performance on these evaluation sets.
arXiv Detail & Related papers (2023-05-24T06:41:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.