Measuring How (Not Just Whether) VLMs Build Common Ground
- URL: http://arxiv.org/abs/2509.03805v1
- Date: Thu, 04 Sep 2025 01:43:49 GMT
- Title: Measuring How (Not Just Whether) VLMs Build Common Ground
- Authors: Saki Imai, Mert İnan, Anthony Sicilia, Malihe Alikhani,
- Abstract summary: We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to evaluate VLM performance in interactive grounding contexts.<n>We deploy the suite on 150 self-play sessions of interactive referential games between three proprietaryVLMs and compare them with human dyads.<n>All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall.
- Score: 29.960223851833785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large vision language models (VLMs) increasingly claim reasoning skills, yet current benchmarks evaluate them in single-turn or question answering settings. However, grounding is an interactive process in which people gradually develop shared understanding through ongoing communication. We introduce a four-metric suite (grounding efficiency, content alignment, lexical adaptation, and human-likeness) to systematically evaluate VLM performance in interactive grounding contexts. We deploy the suite on 150 self-play sessions of interactive referential games between three proprietary VLMs and compare them with human dyads. All three models diverge from human patterns on at least three metrics, while GPT4o-mini is the closest overall. We find that (i) task success scores do not indicate successful grounding and (ii) high image-utterance alignment does not necessarily predict task success. Our metric suite and findings offer a framework for future research on VLM grounding.
Related papers
- Spatial Reasoning with Vision-Language Models in Ego-Centric Multi-View Scenes [14.268621981134293]
Understanding 3D spatial relationships remains a major limitation of current Vision-Language Models (VLMs)<n>We introduce Ego3D-Bench, a new benchmark designed to evaluate the spatial reasoning abilities of VLMs using ego-centric, multi-view outdoor data.<n>We propose Ego3D-VLM, a post-training framework that enhances 3D spatial reasoning of VLMs.
arXiv Detail & Related papers (2025-09-08T01:08:41Z) - AHELM: A Holistic Evaluation of Audio-Language Models [78.20477815156484]
multimodal audio-language models (ALMs) take interleaved audio and text as input and output text.<n>AHELM is a benchmark that aggregates various datasets -- including 2 new synthetic audio-text datasets called PARADE and CoRe-Bench.<n>We also standardize the prompts, inference parameters, and evaluation metrics to ensure equitable comparisons across models.
arXiv Detail & Related papers (2025-08-29T07:40:39Z) - Rethinking Human-Object Interaction Evaluation for both Vision-Language Models and HOI-Specific Methods [33.074167753966314]
We introduce a new benchmarking dataset that reformulates HOI detection as a multiple-answer multiple-choice task.<n>Our results show that large VLMs already surpass state-of-the-art HOI-specific methods across most metrics.
arXiv Detail & Related papers (2025-08-26T07:30:53Z) - A Survey on Video Temporal Grounding with Multimodal Large Language Model [107.24431595873808]
Recent advancement in temporal grounding (VTG) has significantly enhanced fine-grained video understanding.<n>With superior multimodal comprehension and reasoning abilities, VTG approaches based on MLLMs (VTG-MLLMs) are gradually surpassing traditional fine-tuned methods.<n>Despite extensive surveys on general video-language understanding, comprehensive reviews specifically addressing VTG-MLLMs remain scarce.
arXiv Detail & Related papers (2025-08-07T08:52:11Z) - SoMi-ToM: Evaluating Multi-Perspective Theory of Mind in Embodied Social Interactions [21.149270997910403]
SoMi-ToM benchmark is designed to evaluate multi-perspective ToM in embodied multi-agent complex social interactions.<n>We constructed a challenging dataset containing 35 third-person perspective videos, 363 first-person perspective images, and 1225 expert-annotated multiple-choice questions.<n>Results show that LVLMs perform significantly worse than humans on SoMi-ToM.
arXiv Detail & Related papers (2025-06-29T00:54:13Z) - If an LLM Were a Character, Would It Know Its Own Story? Evaluating Lifelong Learning in LLMs [55.8331366739144]
We introduce LIFESTATE-BENCH, a benchmark designed to assess lifelong learning in large language models (LLMs)<n>Our fact checking evaluation probes models' self-awareness, episodic memory retrieval, and relationship tracking, across both parametric and non-parametric approaches.
arXiv Detail & Related papers (2025-03-30T16:50:57Z) - Navigating Rifts in Human-LLM Grounding: Study and Benchmark [30.579037010055092]
We analyze logs from three human-assistant datasets: WildChat, MultiWOZ, and Bing Chat.<n>Our findings reveal significant differences in human-human and human-LLM grounding.<n>Early grounding failures predict later interaction breakdowns.
arXiv Detail & Related papers (2025-03-18T07:24:05Z) - VHELM: A Holistic Evaluation of Vision Language Models [75.88987277686914]
We present the Holistic Evaluation of Vision Language Models (VHELM)
VHELM aggregates various datasets to cover one or more of the 9 aspects: visual perception, knowledge, reasoning, bias, fairness, multilinguality, robustness, toxicity, and safety.
Our framework is designed to be lightweight and automatic so that evaluation runs are cheap and fast.
arXiv Detail & Related papers (2024-10-09T17:46:34Z) - ING-VP: MLLMs cannot Play Easy Vision-based Games Yet [40.851540679589256]
multimodal large language models (MLLMs) continue to demonstrate increasingly competitive performance across a broad spectrum of tasks.
Existing multimodal benchmarks fall short in providing a focused evaluation of multi-step planning based on spatial relationships in images.
We present ING-VP, the first INteractive Game-based Vision Planning benchmark, specifically designed to evaluate the spatial imagination and multi-step reasoning abilities of MLLMs.
arXiv Detail & Related papers (2024-10-09T05:17:38Z) - MM-R$^3$: On (In-)Consistency of Vision-Language Models (VLMs) [26.475993408532304]
We analyze performance of SoTA Vision Language Models on three tasks: Question Rephrasing, Image Restyling, and Context Reasoning.<n>Our analysis reveals that consistency does not always align with accuracy, indicating that models with higher accuracy are not necessarily more consistent, and vice versa.<n>We propose a simple yet effective mitigation strategy in the form of an adapter module trained to minimize inconsistency across prompts.
arXiv Detail & Related papers (2024-10-07T06:36:55Z) - Prismatic VLMs: Investigating the Design Space of Visually-Conditioned Language Models [73.40350756742231]
Visually-conditioned language models (VLMs) have seen growing adoption in applications such as visual dialogue, scene understanding, and robotic task planning.
Despite the volume of new releases, key design decisions around image preprocessing, architecture, and optimization are under-explored.
arXiv Detail & Related papers (2024-02-12T18:21:14Z) - Evaluating Human-Language Model Interaction [79.33022878034627]
We develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems.
We design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation.
We find that better non-interactive performance does not always translate to better human-LM interaction.
arXiv Detail & Related papers (2022-12-19T18:59:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.