Related papers: VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT

URL: http://arxiv.org/abs/2504.04471v1
Date: Sun, 06 Apr 2025 13:03:34 GMT
Title: VideoAgent2: Enhancing the LLM-Based Agent System for Long-Form Video Understanding by Uncertainty-Aware CoT
Authors: Zhuo Zhi, Qiangqiang Wu, Minghe shen, Wenbo Li, Yinchuan Li, Kun Shao, Kaiwen Zhou,
Abstract summary: We propose a specialized chain-of-thought (CoT) process tailored for long video analysis.<n>Our uncertainty-aware CoT effectively mitigates noise from external tools, leading to more reliable outputs.<n>We implement our approach in a system called VideoAgent2, which also includes additional modules such as general context acquisition and specialized tool design.
Score: 31.413204839972984
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Long video understanding has emerged as an increasingly important yet challenging task in computer vision. Agent-based approaches are gaining popularity for processing long videos, as they can handle extended sequences and integrate various tools to capture fine-grained information. However, existing methods still face several challenges: (1) they often rely solely on the reasoning ability of large language models (LLMs) without dedicated mechanisms to enhance reasoning in long video scenarios; and (2) they remain vulnerable to errors or noise from external tools. To address these issues, we propose a specialized chain-of-thought (CoT) process tailored for long video analysis. Our proposed CoT with plan-adjust mode enables the LLM to incrementally plan and adapt its information-gathering strategy. We further incorporate heuristic uncertainty estimation of both the LLM and external tools to guide the CoT process. This allows the LLM to assess the reliability of newly collected information, refine its collection strategy, and make more robust decisions when synthesizing final answers. Empirical experiments show that our uncertainty-aware CoT effectively mitigates noise from external tools, leading to more reliable outputs. We implement our approach in a system called VideoAgent2, which also includes additional modules such as general context acquisition and specialized tool design. Evaluation on three dedicated long video benchmarks (and their subsets) demonstrates that VideoAgent2 outperforms the previous state-of-the-art agent-based method, VideoAgent, by an average of 13.1% and achieves leading performance among all zero-shot approaches

Related papers

Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
LVAgent: Long Video Understanding by Multi-Round Dynamical Collaboration of MLLM Agents [18.580893774836845]
We introduce LVAgent, a framework enabling multi-round dynamic collaboration of MLLM agents in long video understanding.<n>Our methodology consists of four key steps: Selection, Perception, Action and Reflection.<n>Our LVAgent achieves an accuracy of 80% on four mainstream long video understanding tasks.
arXiv Detail & Related papers (2025-03-13T09:35:09Z)
Seeing and Reasoning with Confidence: Supercharging Multimodal LLMs with an Uncertainty-Aware Agentic Framework [23.42251949130555]
Multimodal large language models (MLLMs) show promise in tasks like visual question answering (VQA) Recent works adapt agentic frameworks or chain-of-thought (CoT) reasoning to improve performance. We propose Seeing and Reasoning with Confidence (SRICE), a training-free multimodal reasoning framework.
arXiv Detail & Related papers (2025-03-11T11:18:53Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo captures high-level cognitive signals in the representation space, guiding when to invoke tools.<n>Our experiments show that MeCo accurately detects LLMs' internal cognitive signals and significantly improves tool-use decision-making.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.<n>We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.<n>Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z)
VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence. Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o. We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z)
OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer [14.503628667535425]
processing extensive videos presents significant challenges due to the vast data and processing demands. We develop OmAgent, efficiently stores and retrieves relevant video frames for specific queries. It features an Divide-and-Conquer Loop capable of autonomous reasoning. We have endowed it with greater autonomy and a robust tool-calling system, enabling it to accomplish even more intricate tasks.
arXiv Detail & Related papers (2024-06-24T13:05:39Z)
How Good is my Video LMM? Complex Video Reasoning and Robustness Evaluation Suite for Video-LMMs [98.37571997794072]
We present the Complex Video Reasoning and Robustness Evaluation Suite (CVRR-ES) CVRR-ES comprehensively assesses the performance of Video-LMMs across 11 diverse real-world video dimensions. Our findings provide valuable insights for building the next generation of human-centric AI systems.
arXiv Detail & Related papers (2024-05-06T17:59:45Z)
VURF: A General-purpose Reasoning and Self-refinement Framework for Video Understanding [65.12464615430036]
This paper introduces a Video Understanding and Reasoning Framework (VURF) based on the reasoning power of Large Language Models (LLMs)<n>Ours is a novel approach to extend the utility of LLMs in the context of video tasks, leveraging their capacity to generalize from minimal input and output demonstrations within a contextual framework.
arXiv Detail & Related papers (2024-03-21T18:00:00Z)
VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding [28.316828641898375]
VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video. 2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task.
arXiv Detail & Related papers (2024-03-18T05:07:59Z)
VideoAgent: Long-form Video Understanding with Large Language Model as Agent [26.903040507914053]
We introduce a novel agent-based system, VideoAgent, that employs a large language model as a central agent to identify and compile crucial information to answer a question. We demonstrate superior effectiveness and efficiency of our method over the current state-of-the-art methods.
arXiv Detail & Related papers (2024-03-15T17:57:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.