Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding
- URL: http://arxiv.org/abs/2603.04977v1
- Date: Thu, 05 Mar 2026 09:16:07 GMT
- Title: Think, Then Verify: A Hypothesis-Verification Multi-Agent Framework for Long Video Understanding
- Authors: Zheng Wang, Haoran Chen, Haoxuan Qin, Zhipeng Wei, Tianwen Qian, Cong Bai,
- Abstract summary: VideoHV-Agent is a framework that reformulates video question answering as a structured hypothesis-verification process.<n>We show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost.
- Score: 25.82963105515627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long video understanding is challenging due to dense visual redundancy, long-range temporal dependencies, and the tendency of chain-of-thought and retrieval-based agents to accumulate semantic drift and correlation-driven errors. We argue that long-video reasoning should begin not with reactive retrieval, but with deliberate task formulation: the model must first articulate what must be true in the video for each candidate answer to hold. This thinking-before-finding principle motivates VideoHV-Agent, a framework that reformulates video question answering as a structured hypothesis-verification process. Based on video summaries, a Thinker rewrites answer candidates into testable hypotheses, a Judge derives a discriminative clue specifying what evidence must be checked, a Verifier grounds and tests the clue using localized, fine-grained video content, and an Answer agent integrates validated evidence to produce the final answer. Experiments on three long-video understanding benchmarks show that VideoHV-Agent achieves state-of-the-art accuracy while providing enhanced interpretability, improved logical soundness, and lower computational cost. We make our code publicly available at: https://github.com/Haorane/VideoHV-Agent.
Related papers
- Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning [32.71093573332936]
VideoDR is the first video deep research benchmark for studying video agents in open-web settings.<n>VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence.
arXiv Detail & Related papers (2026-01-11T15:07:37Z) - VideoAuto-R1: Video Auto Reasoning via Thinking Once, Answering Twice [88.93674345138054]
Chain-of-thought (CoT) reasoning has emerged as a powerful tool for multimodal large language models on video understanding tasks.<n>We propose VideoAuto-R1, a video understanding framework that adopts a reason-when-necessary strategy.
arXiv Detail & Related papers (2026-01-08T18:00:59Z) - Video-BrowseComp: Benchmarking Agentic Video Research on Open Web [64.53060049124961]
Video-BrowseComp is a benchmark comprising 210 questions tailored for open-web agentic video reasoning.<n>It enforces a mandatory dependency on temporal visual evidence, ensuring answers cannot be derived solely through text search.<n>As the first open-web video research benchmark, Video-BrowseComp advances the field beyond passive perception toward proactive video reasoning.
arXiv Detail & Related papers (2025-12-28T19:08:27Z) - Video-R2: Reinforcing Consistent and Grounded Reasoning in Multimodal Language Models [56.851611990473174]
Reasoning over dynamic visual content remains a central challenge for large language models.<n>We propose a reinforcement learning approach that enhances both temporal precision and reasoning consistency.<n>The resulting model, Video R2, achieves consistently higher TAC, VAS, and accuracy across multiple benchmarks.
arXiv Detail & Related papers (2025-11-28T18:59:58Z) - Video-CoM: Interactive Video Reasoning via Chain of Manipulations [78.64256470920166]
We introduce Interactive Video Reasoning, enabling models to "think with videos"<n>Our model, Video CoM, reasons through a Chain of Manipulations (CoM), performing iterative visual actions to gather and refine evidence.<n>Video CoM achieves strong results across nine video reasoning benchmarks, improving average performance by 3.6 percent over recent state of art models.
arXiv Detail & Related papers (2025-11-28T18:59:57Z) - VideoExplorer: Think With Videos For Agentic Long-Video Understanding [117.68219930263153]
Long-video understanding is a challenging problem in computer vision.<n>We propose VideoExplorer, a framework grounded in the principle of thinking with video''<n>Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding.
arXiv Detail & Related papers (2025-06-12T15:39:10Z) - MINERVA: Evaluating Complex Video Reasoning [72.12644008002566]
We provide a new video reasoning dataset called MINERVA for modern multimodal models.<n>Our dataset is multimodal, diverse in terms of video domain and length, and consists of complex multi-step questions.<n>We perform fine-grained error analysis to identify common failure modes across various models, and create a taxonomy of reasoning errors.
arXiv Detail & Related papers (2025-05-01T17:41:49Z) - Fill-in-the-blank as a Challenging Video Understanding Evaluation
Framework [19.031957183047048]
We introduce a novel dataset consisting of 28,000 videos and fill-in-the-blank tests.
We show that both a multimodal model and a strong language model have a large gap with human performance.
arXiv Detail & Related papers (2021-04-09T04:00:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.