ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models
- URL: http://arxiv.org/abs/2512.01424v3
- Date: Thu, 04 Dec 2025 07:45:10 GMT
- Title: ViRectify: A Challenging Benchmark for Video Reasoning Correction with Multimodal Large Language Models
- Authors: Xusen Hei, Jiali Chen, Jinyu Yang, Mengchen Zhao, Yi Cai,
- Abstract summary: We construct a dataset of over 30K instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains.<n>In ViRectify, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding.<n>In addition, we propose the trajectory evidence-driven correction framework, comprising step-wise error trajectory and reward modeling on visual evidence-grounded correction.
- Score: 23.37951284612929
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: As multimodal large language models (MLLMs) frequently exhibit errors in complex video reasoning scenarios, correcting these errors is critical for uncovering their weaknesses and improving performance. However, existing benchmarks lack systematic evaluation of MLLMs' ability to identify and correct these video reasoning errors. To bridge this gap, we propose ViRectify, a comprehensive benchmark to evaluate their fine-grained correction capability. Through an AI-assisted annotation pipeline with human verification, we construct a dataset of over 30K instances spanning dynamic perception, scientific reasoning, and embodied decision-making domains. In ViRectify, we challenge MLLMs to perform step-wise error identification and generate rationales with key video evidence grounding. In addition, we further propose the trajectory evidence-driven correction framework, comprising step-wise error trajectory and reward modeling on visual evidence-grounded correction. It encourages the model to explicitly concentrate on error propagation and key timestamps for correction. Extensive evaluation across 16 advanced MLLMs demonstrates that our ViRectify serves as a challenging testbed, where GPT-5 achieves only 31.94% correction accuracy. Our framework enables a Qwen2.5-VL-7B to consistently outperform the variants of 72B on ViRectify, showing the effectiveness of our approach. Further analysis uncovers systematic asymmetries in error correction across models, and our dataset is also a valuable data resource to perform reflection learning. We believe ViRectify provides a new direction for comprehensively evaluating the advanced MLLMs in video reasoning.
Related papers
- History-Guided Iterative Visual Reasoning with Self-Correction [29.952971071458574]
Self-consistency methods are the core technique for improving the reasoning reliability of large language models (MLLMs)<n>We propose the H-GIVR framework, inspired by the human reasoning behavior of repeated verification and dynamic error correction.<n>The framework can significantly improve cross-modal reasoning accuracy while maintaining low computational cost.
arXiv Detail & Related papers (2026-02-04T10:42:06Z) - ReCALL: Recalibrating Capability Degradation for MLLM-based Composed Image Retrieval [64.14282916266998]
Composed Image Retrieval aims to retrieve target images based on a hybrid query comprising a reference image and a modification text.<n>We propose ReCALL, a model-agnostic framework that follows a diagnose-generate-refine pipeline.<n>Experiments on CIRR and FashionIQ show that ReCALL consistently recalibrates degraded capabilities and achieves state-of-the-art performance.
arXiv Detail & Related papers (2026-02-02T04:52:54Z) - Vision Language Models Are Not (Yet) Spelling Correctors [0.742779257315787]
Spelling correction from visual input poses unique challenges for vision language models (VLMs)<n>We present ReViCo, the first benchmark that systematically evaluates VLMs on real-world visual spelling correction across Chinese and English.
arXiv Detail & Related papers (2025-09-22T07:10:42Z) - Towards Automated Error Discovery: A Study in Conversational AI [48.735443116662026]
We introduce Automated Error Discovery, a framework for detecting and defining errors in conversational AI.<n>We also propose SEEED (Soft Clustering Extended-Based Error Detection), as an encoder-based approach to its implementation.
arXiv Detail & Related papers (2025-09-13T14:53:22Z) - Robust Diagram Reasoning: A Framework for Enhancing LVLM Performance on Visually Perturbed Scientific Diagrams [0.81996963503528]
Large Language Models (LLMs) and their multimodal variants (LVLMs) hold immense promise for scientific and engineering applications.<n>Existing evaluation benchmarks largely overlook this challenge, leaving the robust reasoning capabilities of LVLMs underexplored.<n>We introduce the Robust Diagram Reasoning (RDR) framework, a novel approach designed to enhance and rigorously evaluate LVLMs' performance under such conditions.
arXiv Detail & Related papers (2025-08-23T09:50:58Z) - Unblocking Fine-Grained Evaluation of Detailed Captions: An Explaining AutoRater and Critic-and-Revise Pipeline [58.832237984587664]
We develop VNLI-Critique, a model for automated sentence-level factuality classification and critique generation.<n>We highlight three key applications: (1) VNLI-Critique demonstrates robust generalization, validated by state-of-the-art performance on the M-HalDetect benchmark; (2) The VNLI-Critique driven AutoRater for DOCCI-Critique provides reliable VLM rankings, showing excellent alignment with human factuality judgments; and (3) An innovative Critic-and-Revise pipeline, achieves substantial improvements in caption factuality.
arXiv Detail & Related papers (2025-06-09T10:57:26Z) - Vad-R1: Towards Video Anomaly Reasoning via Perception-to-Cognition Chain-of-Thought [58.321044666612174]
Vad-R1 is an end-to-end MLLM-based framework for Video Anomaly Reasoning.<n>We design a Perception-to-Cognition Chain-of-Thought (P2C-CoT) that simulates the human process of recognizing anomalies.<n>We also propose an improved reinforcement learning algorithm AVA-GRPO, which explicitly incentivizes the anomaly reasoning capability of MLLMs.
arXiv Detail & Related papers (2025-05-26T12:05:16Z) - SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z) - Self-Evolving Critique Abilities in Large Language Models [59.861013614500024]
This paper explores enhancing critique abilities of Large Language Models (LLMs)<n>We introduce SCRIT, a framework that trains LLMs with self-generated data to evolve their critique abilities.<n>Our analysis reveals that SCRIT's performance scales positively with data and model size.
arXiv Detail & Related papers (2025-01-10T05:51:52Z) - Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage [50.84150600032693]
Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations.<n>We propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions.<n>Our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V.
arXiv Detail & Related papers (2024-12-20T01:37:22Z) - MQM-APE: Toward High-Quality Error Annotation Predictors with Automatic Post-Editing in LLM Translation Evaluators [53.91199933655421]
Large Language Models (LLMs) have shown significant potential as judges for Machine Translation (MT) quality assessment.<n>We introduce a universal and training-free framework, $textbfMQM-APE, based on the idea of filtering out non-impactful errors.<n>Experiments show that our approach consistently improves both the reliability and quality of error spans against GEMBA-MQM.
arXiv Detail & Related papers (2024-09-22T06:43:40Z) - RETAIN: Interactive Tool for Regression Testing Guided LLM Migration [8.378294455013284]
RETAIN (REgression Testing guided LLM migrAtIoN) is a tool designed explicitly for regression testing in LLM Migrations.
Our automatic evaluation and empirical user studies demonstrate that RETAIN, when compared to manual evaluation, enabled participants to identify twice as many errors, facilitated experimentation with 75% more prompts, and achieves 12% higher metric scores in a given time frame.
arXiv Detail & Related papers (2024-09-05T22:22:57Z) - Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors [11.07539342949602]
We propose an end-to-end framework for detecting factual errors in text summarization.
Our framework uses a diverse set of LLM prompts to identify factual inconsistencies.
We calibrate the ensembled models to produce empirically accurate probabilities that a text is factually consistent or free of hallucination.
arXiv Detail & Related papers (2024-06-18T18:59:37Z) - Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance [51.30560006045442]
Image-gRounded guIdaNcE (MARINE) is a framework that is both training-free and API-free.<n>MARINE effectively and efficiently reduces object hallucinations during inference by introducing image-grounded guidance to LVLMs.<n>Our framework's flexibility further allows for the integration of multiple vision models, enabling more reliable and robust object-level guidance.
arXiv Detail & Related papers (2024-02-13T18:59:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.