CAViAR: Critic-Augmented Video Agentic Reasoning
- URL: http://arxiv.org/abs/2509.07680v1
- Date: Tue, 09 Sep 2025 17:59:39 GMT
- Title: CAViAR: Critic-Augmented Video Agentic Reasoning
- Authors: Sachit Menon, Ahmet Iscen, Arsha Nagrani, Tobias Weyand, Carl Vondrick, Cordelia Schmid,
- Abstract summary: We ask: can perception capabilities be leveraged to perform more complex video reasoning?<n>We develop a large language model agent given access to video modules as subagents or tools.<n>We show that the combination of our agent and critic achieve strong performance on datasets.
- Score: 90.48729440775223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video understanding has seen significant progress in recent years, with models' performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.
Related papers
- LongVideoAgent: Multi-Agent Reasoning with Long Videos [69.28914905197426]
We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations.<n>The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation.<n>On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines.
arXiv Detail & Related papers (2025-12-23T18:59:49Z) - Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding [56.45689495743107]
Vgent is a graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding.<n>We evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks.
arXiv Detail & Related papers (2025-10-15T19:14:58Z) - Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding [63.82450803014141]
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity.<n>We propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips.<n>Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset.
arXiv Detail & Related papers (2025-05-23T16:37:36Z) - ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation [49.1574468325115]
This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA)<n>It combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment.<n>This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks.
arXiv Detail & Related papers (2025-05-21T18:32:43Z) - VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding [48.26536049440913]
Video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities.<n>Their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data.<n>Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs.<n>We propose VideoICL, a novel video in-context learning framework for OOD tasks.
arXiv Detail & Related papers (2024-12-03T05:54:43Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We present ViLLa: Video reasoning segmentation with Large Language Model.<n>Our ViLLa manages to tackle these challenges through multiple core innovations.<n>To enable efficient processing of long videos, ViLLa incorporates (3) a key segment sampler that adaptively partitions long videos into shorter but semantically dense segments for less redundancy.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills.
utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches.
To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.