Related papers: Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images

Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images

URL: http://arxiv.org/abs/2601.11633v1
Date: Wed, 14 Jan 2026 07:25:15 GMT
Title: Beyond Accuracy: Evaluating Grounded Visual Evidence in Thinking with Images
Authors: Xuchen Li, Xuzhao Li, Renjie Pi, Shiyu Hu, Jian Zhao, Jiahui Gao,
Abstract summary: We propose ViEBench, a process-verifiable benchmark designed to evaluate faithful visual reasoning.<n>Comprising 200 multi-scenario high-resolution images with expert-annotated visual evidence, ViEBench categorizes tasks by difficulty into perception and reasoning dimensions.<n>Our experiments yield several interesting observations: (1) VLMs can sometimes produce correct final answers despite grounding on irrelevant regions, and (2) they may successfully locate the correct evidence but still fail to utilize it to reach accurate conclusions.
Score: 34.324634481264034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Despite the remarkable progress of Vision-Language Models (VLMs) in adopting "Thinking-with-Images" capabilities, accurately evaluating the authenticity of their reasoning process remains a critical challenge. Existing benchmarks mainly rely on outcome-oriented accuracy, lacking the capability to assess whether models can accurately leverage fine-grained visual cues for multi-step reasoning. To address these limitations, we propose ViEBench, a process-verifiable benchmark designed to evaluate faithful visual reasoning. Comprising 200 multi-scenario high-resolution images with expert-annotated visual evidence, ViEBench uniquely categorizes tasks by difficulty into perception and reasoning dimensions, where reasoning tasks require utilizing localized visual details with prior knowledge. To establish comprehensive evaluation criteria, we introduce a dual-axis matrix that provides fine-grained metrics through four diagnostic quadrants, enabling transparent diagnosis of model behavior across varying task complexities. Our experiments yield several interesting observations: (1) VLMs can sometimes produce correct final answers despite grounding on irrelevant regions, and (2) they may successfully locate the correct evidence but still fail to utilize it to reach accurate conclusions. Our findings demonstrate that ViEBench can serve as a more explainable and practical benchmark for comprehensively evaluating the effectiveness agentic VLMs. The codes will be released at: https://github.com/Xuchen-Li/ViEBench.

Related papers

LTD-Bench: Evaluating Large Language Models by Letting Them Draw [57.237152905238084]
LTD-Bench is a breakthrough benchmark for large language models (LLMs)<n>It transforms LLM evaluation from abstract scores to directly observable visual outputs by requiring models to generate drawings through dot matrices or executable code.<n> LTD-Bench's visual outputs enable powerful diagnostic analysis, offering a potential approach to investigate model similarity.
arXiv Detail & Related papers (2025-11-04T08:11:23Z)
Demystifying deep search: a holistic evaluation with hint-free multi-hop questions and factorised metrics [89.1999907891494]
We present WebDetective, a benchmark of hint-free multi-hop questions paired with a controlled Wikipedia sandbox.<n>Our evaluation of 25 state-of-the-art models reveals systematic weaknesses across all architectures.<n>We develop an agentic workflow, EvidenceLoop, that explicitly targets the challenges our benchmark identifies.
arXiv Detail & Related papers (2025-10-01T07:59:03Z)
Probing Vision-Language Understanding through the Visual Entailment Task: promises and pitfalls [0.10923877073891446]
This study investigates the extent to which the Visual Entailment task serves as a reliable probe of vision-language understanding in multimodal language models.<n>We conduct experiments across zero-shot, few-shot, and fine-tuning settings, exploring how factors such as prompt design might affect VE performance.<n>Fine-tuning yields strong results, achieving an accuracy of 83.3% on the e-SNLI-VE dataset and outperforming the state-of-the-art OFA-X model.
arXiv Detail & Related papers (2025-07-23T12:46:51Z)
VLMs have Tunnel Vision: Evaluating Nonlocal Visual Reasoning in Leading VLMs [18.349695067647012]
Visual Language Models excel at complex visual tasks such as VQA and chart understanding, yet recent work suggests they struggle with simple tests.<n>We present an evaluation that tests vision-language models' capacity for nonlocal visual reasoning.<n>Our findings show that despite gains in raw visual acuity, current models lack core visual reasoning capabilities.
arXiv Detail & Related papers (2025-07-04T23:15:52Z)
BYO-Eval: Build Your Own Dataset for Fine-Grained Visual Assessment of Multimodal Language Models [2.526146573337397]
We propose a new evaluation methodology, inspired by ophthalmologic diagnostics.<n>We use procedural generation of synthetic images to obtain control over visual attributes.<n>This diagnostic allows systematic stress testing and fine-grained failure analysis.
arXiv Detail & Related papers (2025-06-05T12:43:10Z)
VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning [56.99825489208698]
We introduce VisionReasoner, a unified framework capable of reasoning and solving multiple visual perception tasks.<n> VisionReasoner enhances its reasoning capabilities to analyze visual inputs, and addresses diverse perception tasks within a unified model.<n>We evaluate VisionReasoner on ten diverse tasks spanning three critical domains: detection, segmentation, and counting.
arXiv Detail & Related papers (2025-05-17T16:51:47Z)
Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing [84.16442052968615]
We introduce RISEBench, the first benchmark for evaluating Reasoning-Informed viSual Editing (RISE)<n>RISEBench focuses on four key reasoning categories: Temporal, Causal, Spatial, and Logical Reasoning.<n>We conduct experiments evaluating nine prominent visual editing models, comprising both open-source and proprietary models.
arXiv Detail & Related papers (2025-04-03T17:59:56Z)
A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning [9.786907179872815]
The potential of vision and language remains underexplored in face forgery detection. There is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task. We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap.
arXiv Detail & Related papers (2024-10-01T08:16:40Z)
Freeview Sketching: View-Aware Fine-Grained Sketch-Based Image Retrieval [85.73149096516543]
We address the choice of viewpoint during sketch creation in Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) A pilot study highlights the system's struggle when query-sketches differ in viewpoint from target instances. To reconcile this, we advocate for a view-aware system, seamlessly accommodating both view-agnostic and view-specific tasks.
arXiv Detail & Related papers (2024-07-01T21:20:44Z)
VALOR-EVAL: Holistic Coverage and Faithfulness Evaluation of Large Vision-Language Models [57.43276586087863]
Large Vision-Language Models (LVLMs) suffer from hallucination issues, wherein the models generate plausible-sounding but factually incorrect outputs. Existing benchmarks are often limited in scope, focusing mainly on object hallucinations. We introduce a multi-dimensional benchmark covering objects, attributes, and relations, with challenging images selected based on associative biases.
arXiv Detail & Related papers (2024-04-22T04:49:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.