LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
- URL: http://arxiv.org/abs/2503.10200v4
- Date: Thu, 17 Jul 2025 03:55:18 GMT
- Title: LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents
- Authors: Boyu Chen, Zhengrong Yue, Siran Chen, Zikang Wang, Yang Liu, Peng Li, Yali Wang,
- Abstract summary: Existing MLLMs encounter significant challenges in modeling the temporal context within long videos.<n>We introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding.<n>Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks.
- Score: 18.580893774836845
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Existing MLLMs encounter significant challenges in modeling the temporal context within long videos. Currently, mainstream Agent-based methods use external tools to assist a single MLLM in answering long video questions. Despite such tool-based support, a solitary MLLM still offers only a partial understanding of long videos, resulting in limited performance. In order to better address long video tasks, we introduce LVAgent, the first framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding. Our method consists of four key steps: 1) Selection: We pre-select appropriate agents from the model library to form optimal agent teams based on different tasks. 2) Perception: We design an effective retrieval scheme for long videos to improve the coverage of critical temporal segments while maintaining computational efficiency. 3) Action: Agents answer long video questions and exchange reasons. 4) Reflection: We evaluate each agent's performance in each round of discussion and optimize the agent team for dynamic collaboration. The agents iteratively refine their answers by multi-round dynamical collaboration of MLLM agents. LVAgent is the first agent system method that outperforms all closed-source models (like GPT-4o) and open-source models (like InternVL-2.5 and Qwen2-VL) in the long video understanding tasks. Our LVAgent achieves an accuracy of 80\% on four mainstream long video understanding tasks. Notably, LVAgent improves accuracy by 13.3\% on LongVideoBench. Code is available at https://github.com/64327069/LVAgent.
Related papers
- VideoDeepResearch: Long Video Understanding With Agentic Tool Using [62.266123635522405]
Long video understanding (LVU) presents a significant challenge for current multi-modal large language models (MLLMs)<n>We introduce VideoDeepResearch, a novel agentic framework for long video understanding.<n>For each LVU task, the system formulates a problem-solving strategy through reasoning, while selectively accessing and utilizing essential video content via tool using.
arXiv Detail & Related papers (2025-06-12T15:39:10Z) - ViQAgent: Zero-Shot Video Question Answering via Agent with Open-Vocabulary Grounding Validation [49.1574468325115]
This work presents an LLM-brained agent for zero-shot Video Question Answering (VideoQA)<n>It combines a Chain-of-Thought framework with grounding reasoning alongside YOLO-World to enhance object tracking and alignment.<n>This approach establishes a new state-of-the-art in VideoQA and Video Understanding, showing enhanced performance on NExT-QA, iVQA, and ActivityNet-QA benchmarks.
arXiv Detail & Related papers (2025-05-21T18:32:43Z) - VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT [31.413204839972984]
We propose a specialized chain-of-thought (CoT) process tailored for long video analysis.
Our uncertainty-aware CoT effectively mitigates noise from external tools, leading to more reliable outputs.
We implement our approach in a system called VideoAgent2, which also includes additional modules such as general context acquisition and specialized tool design.
arXiv Detail & Related papers (2025-04-06T13:03:34Z) - MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos [62.01402470874109]
We present MomentSeeker, a benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval tasks.<n>It incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval.<n>It covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios.<n>We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark.
arXiv Detail & Related papers (2025-02-18T05:50:23Z) - InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.<n>We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.<n> Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z) - AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents [52.13695464678006]
This study enhances an LLM-based web agent by simply refining its observation and action space.<n>AgentOccam surpasses the previous state-of-the-art and concurrent work by 9.8 (+29.4%) and 5.9 (+15.8%) absolute points respectively.
arXiv Detail & Related papers (2024-10-17T17:50:38Z) - Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [118.08008540513596]
Video-MME is the first-ever full-spectrum, Multi-Modal Evaluation benchmark of MLLMs in Video analysis.
We extensively evaluate various state-of-the-art MLLMs, including GPT-4 series and Gemini 1.5 Pro, as well as open-source image models.
Our experiments reveal that Gemini 1.5 Pro is the best-performing commercial model, significantly outperforming the open-source models.
arXiv Detail & Related papers (2024-05-31T17:59:47Z) - Enhancing the General Agent Capabilities of Low-Parameter LLMs through Tuning and Multi-Branch Reasoning [56.82041895921434]
Open-source pre-trained Large Language Models (LLMs) exhibit strong language understanding and generation capabilities.
When used as agents for dealing with complex problems in the real world, their performance is far inferior to large commercial models such as ChatGPT and GPT-4.
arXiv Detail & Related papers (2024-03-29T03:48:12Z) - VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding [28.316828641898375]
VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video.
2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task.
arXiv Detail & Related papers (2024-03-18T05:07:59Z) - Large Language Model based Multi-Agents: A Survey of Progress and Challenges [44.92286030322281]
Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks.
Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation.
arXiv Detail & Related papers (2024-01-21T23:36:14Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - Recommender AI Agent: Integrating Large Language Models for Interactive
Recommendations [53.76682562935373]
We introduce an efficient framework called textbfInteRecAgent, which employs LLMs as the brain and recommender models as tools.
InteRecAgent achieves satisfying performance as a conversational recommender system, outperforming general-purpose LLMs.
arXiv Detail & Related papers (2023-08-31T07:36:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.