LongVideoAgent: Multi-Agent Reasoning with Long Videos
- URL: http://arxiv.org/abs/2512.20618v1
- Date: Tue, 23 Dec 2025 18:59:49 GMT
- Title: LongVideoAgent: Multi-Agent Reasoning with Long Videos
- Authors: Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen,
- Abstract summary: We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations.<n>The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation.<n>On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines.
- Score: 69.28914905197426
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.
Related papers
- LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding [106.23494088118571]
LongVideo-R1 is a multimodal large language model (MLLM) agent for efficient video context navigation.<n>It infers the most informative video clip for subsequent processing.<n>The LongVideo-R1 agent is fine-tuned upon the Qwen-3-8B model through a two-stage paradigm.
arXiv Detail & Related papers (2026-02-24T13:49:47Z) - LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling [87.98096428508181]
LongVT is an end-to-end agentic framework that enables "Thinking with Long Videos" via interleaved Multimodal Chain-of-Tool-Thought.<n>We exploit LMMs' inherent temporal grounding ability as a native video cropping tool to zoom in on a specific video clip and resample finer-grained video frames.<n>Our training dataset consists of 247.9K samples for tool-integrated cold-start supervised fine-tuning, 1.6K samples for agentic reinforcement learning, and 15.4K samples for agentic reinforcement fine-tuning.
arXiv Detail & Related papers (2025-11-25T19:22:48Z) - CAViAR: Critic-Augmented Video Agentic Reasoning [90.48729440775223]
We ask: can perception capabilities be leveraged to perform more complex video reasoning?<n>We develop a large language model agent given access to video modules as subagents or tools.<n>We show that the combination of our agent and critic achieve strong performance on datasets.
arXiv Detail & Related papers (2025-09-09T17:59:39Z) - Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding [60.88843818016968]
Long-form video understanding presents significant challenges due to temporal-spatial complexity and difficulty of question answering.<n>We propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips.<n>Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%.
arXiv Detail & Related papers (2025-05-23T16:37:36Z) - ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation [49.1574468325115]
This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA)<n>It combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment.<n>This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks.
arXiv Detail & Related papers (2025-05-21T18:32:43Z) - VideoMultiAgents: A Multi-Agent Framework for Video Question Answering [11.514596823413736]
Video Question Answering (VQA) inherently relies on multimodal reasoning.<n>We introduce VideoMultiAgents, a framework that integrates specialized agents for vision, scene graph analysis, and text processing.<n>Our approach is also supplemented with a question-guided caption generation, which produces captions that highlight objects, actions, and temporal transitions.
arXiv Detail & Related papers (2025-04-25T22:08:09Z) - LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents [18.580893774836845]
Existing MLLMs encounter significant challenges in modeling the temporal context within long videos.<n>We introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding.<n>Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks.
arXiv Detail & Related papers (2025-03-13T09:35:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.