Related papers: Evaluating Self-Correcting Vision Agents Through Quantitative and Qualitative Metrics

Evaluating Self-Correcting Vision Agents Through Quantitative and Qualitative Metrics

URL: http://arxiv.org/abs/2601.11637v1
Date: Wed, 14 Jan 2026 15:17:11 GMT
Title: Evaluating Self-Correcting Vision Agents Through Quantitative and Qualitative Metrics
Authors: Aradhya Dixit,
Abstract summary: Vision-Language Agents (VLAs) can decompose complex visual tasks into executable tool-based plans.<n>Recent benchmarks have begun to evaluate iterative self-correction, but its quantitative limits and dominant reasoning bottlenecks remain poorly characterized.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent progress in multimodal foundation models has enabled Vision-Language Agents (VLAs) to decompose complex visual tasks into executable tool-based plans. While recent benchmarks have begun to evaluate iterative self-correction, its quantitative limits and dominant reasoning bottlenecks remain poorly characterized. This work introduces a Diagnostic Micro-Benchmark. Our analysis decouples Task Success Rate (TSR = 62 percent) from Correction Success Rate (CSR = 25 to 33 percent), revealing that initial competence does not predict repair ability. We explicitly quantify the diminishing returns of correction, which saturates after three retries. Our Failure Taxonomy reveals a frequent factor is Semantic Drift (about 28 percent of failures), a loss of contextual state. By isolating this reasoning bottleneck, this benchmark defines a reproducible framework toward stateful, trustworthy multimodal agents.

Related papers

Observationally Informed Adaptive Causal Experimental Design [55.998153710215654]
We propose Active Residual Learning, a new paradigm that leverages the observational model as a foundational prior.<n>This approach shifts the experimental focus from learning target causal quantities from scratch to efficiently estimating the residuals required to correct observational bias.<n> Experiments on synthetic and semi-synthetic benchmarks demonstrate that R-Design significantly outperforms baselines.
arXiv Detail & Related papers (2026-03-04T06:52:37Z)
DatBench: Discriminative, Faithful, and Efficient VLM Evaluations [17.234602646114997]
Empirical evaluation serves as the primary compass guiding research progress in foundation models.<n>We propose three desiderata that evaluations should satisfy: faithfulness to the modality and application, discriminability between models of varying quality, and efficiency in compute.<n>We release DatBench-Full, a cleaned evaluation suite of 33 datasets spanning nine VLM capabilities, and DatBench, a discriminative subset that achieves 13x average speedup.
arXiv Detail & Related papers (2026-01-05T18:07:51Z)
Reflective Confidence: Correcting Reasoning Flaws via Online Self-Correction [14.164508061248775]
Large language models (LLMs) have achieved strong performance on complex reasoning tasks using techniques such as chain-of-thought and self-consistency.<n>We propose reflective confidence, a novel reasoning framework that transforms low-confidence signals from termination indicators into reflection triggers.<n> Experiments on mathematical reasoning benchmarks, including AIME 2025, demonstrate significant accuracy improvements over advanced early-stopping baselines at comparable computational cost.
arXiv Detail & Related papers (2025-12-21T05:35:07Z)
Continual Action Quality Assessment via Adaptive Manifold-Aligned Graph Regularization [53.82400605816587]
Action Quality Assessment (AQA) quantifies human actions in videos, supporting applications in sports scoring, rehabilitation, and skill evaluation.<n>A major challenge lies in the non-stationary nature of quality distributions in real-world scenarios.<n>We introduce Continual AQA (CAQA), which equips with Continual Learning capabilities to handle evolving distributions.
arXiv Detail & Related papers (2025-10-08T10:09:47Z)
An Empirical Study on Failures in Automated Issue Solving [12.571536148821144]
We analyze the performance and efficiency of three SOTA tools, spanning both pipeline-based and agentic architectures, in automated issue solving tasks of SWE-Bench-Verified.<n>To move from high-level performance metrics to underlying cause analysis, we conducted a systematic manual analysis of 150 failed instances.<n>The results reveal distinct failure fingerprints between the two architectural paradigms, with the majority of agentic failures stemming from flawed reasoning and cognitive deadlocks.
arXiv Detail & Related papers (2025-09-17T13:07:52Z)
Revisiting Reliability in the Reasoning-based Pose Estimation Benchmark [27.134554623769898]
The reasoning-based pose estimation (RPE) benchmark has emerged as a widely adopted evaluation standard for pose-aware large language models (MLLMs)<n>We identified critical and benchmark-quality issues that hinder fair and consistent quantitative evaluations.
arXiv Detail & Related papers (2025-07-17T17:33:11Z)
APHQ-ViT: Post-Training Quantization with Average Perturbation Hessian Based Reconstruction for Vision Transformers [71.2294205496784]
We propose textbfAPHQ-ViT, a novel PTQ approach based on importance estimation with Average Perturbation Hessian (APH)<n>We show that APHQ-ViT using linear quantizers outperforms existing PTQ methods by substantial margins in 3-bit and 4-bit across different vision tasks.
arXiv Detail & Related papers (2025-04-03T11:48:56Z)
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning [112.35483894933904]
We propose VISCO, the first benchmark to extensively analyze the fine-grained critique and correction capabilities of LVLMs.<n>VISCO features dense and fine-grained critique, requiring LVLMs to evaluate the correctness of each step in the chain-of-thought.<n>LookBack significantly improves critique and correction performance by up to 13.5%.
arXiv Detail & Related papers (2024-12-03T05:04:49Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)
Beyond calibration: estimating the grouping loss of modern neural networks [68.8204255655161]
Proper scoring rule theory shows that given the calibration loss, the missing piece to characterize individual errors is the grouping loss. We show that modern neural network architectures in vision and NLP exhibit grouping loss, notably in distribution shifts settings.
arXiv Detail & Related papers (2022-10-28T07:04:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.