Related papers: CAViAR: Critic-Augmented Video Agentic Reasoning

CAViAR: Critic-Augmented Video Agentic Reasoning

URL: http://arxiv.org/abs/2509.07680v1
Date: Tue, 09 Sep 2025 17:59:39 GMT
Title: CAViAR: Critic-Augmented Video Agentic Reasoning
Authors: Sachit Menon, Ahmet Iscen, Arsha Nagrani, Tobias Weyand, Carl Vondrick, Cordelia Schmid,
Abstract summary: We ask: can perception capabilities be leveraged to perform more complex video reasoning?<n>We develop a large language model agent given access to video modules as subagents or tools.<n>We show that the combination of our agent and critic achieve strong performance on datasets.
Score: 90.48729440775223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video understanding has seen significant progress in recent years, with models' performance on perception from short clips continuing to rise. Yet, multiple recent benchmarks, such as LVBench, Neptune, and ActivityNet-RTL, show performance wanes for tasks requiring complex reasoning on videos as queries grow more complex and videos grow longer. In this work, we ask: can existing perception capabilities be leveraged to successfully perform more complex video reasoning? In particular, we develop a large language model agent given access to video modules as subagents or tools. Rather than following a fixed procedure to solve queries as in previous work such as Visual Programming, ViperGPT, and MoReVQA, the agent uses the results of each call to a module to determine subsequent steps. Inspired by work in the textual reasoning domain, we introduce a critic to distinguish between instances of successful and unsuccessful sequences from the agent. We show that the combination of our agent and critic achieve strong performance on the previously-mentioned datasets.

Related papers

LongVideoAgent: Multi-Agent Reasoning with Long Videos [69.28914905197426]
We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations.<n>The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation.<n>On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines.
arXiv Detail & Related papers (2025-12-23T18:59:49Z)
Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding [56.45689495743107]
Vgent is a graph-based retrieval-reasoning-augmented generation framework to enhance LVLMs for long video understanding.<n>We evaluate our framework with various open-source LVLMs on three long-video understanding benchmarks.
arXiv Detail & Related papers (2025-10-15T19:14:58Z)
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding [63.82450803014141]
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity.<n>We propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips.<n>Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset.
arXiv Detail & Related papers (2025-05-23T16:37:36Z)
ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation [49.1574468325115]
This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA)<n>It combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment.<n>This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks.
arXiv Detail & Related papers (2025-05-21T18:32:43Z)
VideoICL: Confidence-based Iterative In-context Learning for Out-of-Distribution Video Understanding [48.26536049440913]
Video large multimodal models (LMMs) have significantly improved their video understanding and reasoning capabilities.<n>Their performance drops on out-of-distribution (OOD) tasks that are underrepresented in training data.<n>Traditional methods like fine-tuning on OOD datasets are impractical due to high computational costs.<n>We propose VideoICL, a novel video in-context learning framework for OOD tasks.
arXiv Detail & Related papers (2024-12-03T05:54:43Z)
ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We present ViLLa: Video reasoning segmentation with Large Language Model.<n>Our ViLLa manages to tackle these challenges through multiple core innovations.<n>To enable efficient processing of long videos, ViLLa incorporates (3) a key segment sampler that adaptively partitions long videos into shorter but semantically dense segments for less redundancy.
arXiv Detail & Related papers (2024-07-18T17:59:17Z)
Investigating Video Reasoning Capability of Large Language Models with Tropes in Movies [69.28082193942991]
This paper introduces a novel dataset, Tropes in Movies (TiM), designed as a testbed for exploring two critical yet previously overlooked video reasoning skills. utilizing tropes from movie storytelling, TiM evaluates the reasoning capabilities of state-of-the-art LLM-based approaches. To address these deficiencies, we propose Face-Enhanced Viper of Role Interactions (FEVoRI) and Context Query Reduction (ConQueR)
arXiv Detail & Related papers (2024-06-16T12:58:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.