Related papers: EEA: Exploration-Exploitation Agent for Long Video Understanding

EEA: Exploration-Exploitation Agent for Long Video Understanding

URL: http://arxiv.org/abs/2512.03500v1
Date: Wed, 03 Dec 2025 06:48:36 GMT
Title: EEA: Exploration-Exploitation Agent for Long Video Understanding
Authors: Te Yang, Xiangyu Zhu, Bo Wang, Quan Chen, Peng Jiang, Zhen Lei,
Abstract summary: Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information.<n>Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing.<n>We introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance.
Score: 24.45791994592314
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-form video understanding requires efficient navigation of extensive visual data to pinpoint sparse yet critical information. Current approaches to longform video understanding either suffer from severe computational overhead due to dense preprocessing, or fail to effectively balance exploration and exploitation, resulting in incomplete information coverage and inefficiency. In this work, we introduce EEA, a novel video agent framework that archives exploration-exploitation balance through semantic guidance with hierarchical tree search process. EEA autonomously discovers and dynamically updates task-relevant semantic queries, and collects video frames closely matched to these queries as semantic anchors. During the tree search process, instead of uniform expansion, EEA preferentially explores semantically relevant frames while ensuring sufficient coverage within unknown segments. Moreover, EEA adaptively combines intrinsic rewards from visionlanguage models (VLMs) with semantic priors by explicitly modeling uncertainty to achieve stable and precise evaluation of video segments. Experiments across various long-video benchmarks validate the superior performance and computational efficiency of our proposed method.

Related papers

Hierarchical Long Video Understanding with Audiovisual Entity Cohesion and Agentic Search [61.88597038104749]
We present HAVEN, a unified framework for long-video understanding that enables coherent and comprehensive reasoning.<n>We preserve semantic consistency by integrating entity-level representations across visual and auditory streams.<n>We employ an agentic search mechanism to enable dynamic retrieval and reasoning across these layers.
arXiv Detail & Related papers (2026-01-20T08:23:29Z)
SeViCES: Unifying Semantic-Visual Evidence Consensus for Long Video Understanding [36.30263540665245]
We propose a framework for effective and reliable long video understanding.<n>SeViCES is training-free and model-agnostic, and introduces two key components.<n> experiments on long video understanding benchmarks show that SeViCES consistently outperforms state-of-the-art methods in both accuracy and robustness.
arXiv Detail & Related papers (2025-10-23T14:55:28Z)
Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs [54.502280390499756]
We propose TimeWarp to create a targeted synthetic temporal dataset to fine-tune the model's responses to encourage it to focus on the given input video.<n>We demonstrate that when our method is applied to existing models, it significantly improves performance on temporal understanding benchmarks.
arXiv Detail & Related papers (2025-10-04T21:48:40Z)
AHA - Predicting What Matters Next: Online Highlight Detection Without Looking Ahead [4.55107996328448]
Aha is an autoregressive highlight detection framework that predicts relevance of each video frame against a task described in natural language.<n>Aha achieves state-of-the-art (SOTA) performance on highlight detection benchmarks.<n>We explore Aha's potential for real-world robotics applications given a task-oriented natural language input and a continuous, robot-centric video.
arXiv Detail & Related papers (2025-09-19T21:03:00Z)
AdaVideoRAG: Omni-Contextual Adaptive Retrieval-Augmented Efficient Long Video Understanding [73.60257070465377]
AdaVideoRAG is a novel framework that adapts retrieval based on query complexity using a lightweight intent classifier.<n>Our framework employs an Omni-Knowledge Indexing module to build hierarchical databases from text (captions, ASR, OCR), visual features, and semantic graphs.<n> Experiments demonstrate improved efficiency and accuracy for long-video understanding, with seamless integration into existing MLLMs.
arXiv Detail & Related papers (2025-06-16T15:18:15Z)
VideoExplorer: Think With Videos For Agentic Long-Video Understanding [117.68219930263153]
Long-video understanding is a challenging problem in computer vision.<n>We propose VideoExplorer, a framework grounded in the principle of thinking with video''<n>Rather than reasoning over a static context, VideoExplorer iteratively formulates sub-questions, locates relevant moments, and performs task-oriented, temporally scalable video understanding.
arXiv Detail & Related papers (2025-06-12T15:39:10Z)
Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding [60.88843818016968]
Long-form video understanding presents significant challenges due to temporal-spatial complexity and difficulty of question answering.<n>We propose the Deep Video Discovery (DVD) agent to leverage an agentic search strategy over segmented video clips.<n>Our DVD agent achieves state-of-the-art performance on the challenging LVBench dataset, reaching an accuracy of 74.2%.
arXiv Detail & Related papers (2025-05-23T16:37:36Z)
Query-centric Audio-Visual Cognition Network for Moment Retrieval, Segmentation and Step-Captioning [56.873534081386]
A new topic HIREST is presented, including video retrieval, moment retrieval, moment segmentation, and step-captioning.<n>We propose a query-centric audio-visual cognition network to construct a reliable multi-modal representation for three tasks.<n>This can cognize user-preferred content and thus attain a query-centric audio-visual representation for three tasks.
arXiv Detail & Related papers (2024-12-18T06:43:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.