VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
- URL: http://arxiv.org/abs/2503.13444v2
- Date: Tue, 01 Apr 2025 03:49:08 GMT
- Title: VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
- Authors: Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou,
- Abstract summary: VideoMind is a novel video-language agent designed for temporal-grounded video understanding.<n>We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow.<n>We propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors.
- Score: 33.37714717781103
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks, including 3 on grounded video question-answering (Grounded VideoQA), 6 on video temporal grounding (VTG), and 5 on general video question-answering (VideoQA), verify that our agent achieves state-of-the-art performance on diverse video understanding tasks, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.
Related papers
- A Novel Trustworthy Video Summarization Algorithm Through a Mixture of LoRA Experts [29.05750068740863]
Video-llama is an effective tool for generating video summarization, but it cannot effectively unify and optimize the modeling of temporal and spatial features.<n>We propose MiLoRA-ViSum to more efficiently capture complex temporal dynamics and spatial relationships inherent in video data.<n>MiLoRA-ViSum achieves the best summarization performance compared to state-of-the-art models, while maintaining significantly lower computational costs.
arXiv Detail & Related papers (2025-03-08T05:20:52Z) - MomentSeeker: A Comprehensive Benchmark and A Strong Baseline For Moment Retrieval Within Long Videos [62.01402470874109]
We present MomentSeeker, a benchmark to evaluate retrieval models' performance in handling general long-video moment retrieval tasks.<n>It incorporates long videos of over 500 seconds on average, making it the first benchmark specialized for long-video moment retrieval.<n>It covers a wide range of task categories (including Moment Search, Caption Alignment, Image-conditioned Moment Search, and Video-conditioned Moment Search) and diverse application scenarios.<n>We further fine-tune an MLLM-based LVMR retriever on synthetic data, which demonstrates strong performance on our benchmark.
arXiv Detail & Related papers (2025-02-18T05:50:23Z) - VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.
However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.
We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z) - Perceive, Query & Reason: Enhancing Video QA with Question-Guided Temporal Queries [50.47265863322891]
Video Question Answering (Video QA) is a challenging video understanding task that requires models to comprehend entire videos.
Recent advancements in Multimodal Large Language Models (MLLMs) have transformed video QA by leveraging their exceptional commonsense reasoning capabilities.
We propose T-Former, a novel temporal modeling method that creates a question-guided temporal bridge between frame-wise visual perception and the reasoning capabilities of LLMs.
arXiv Detail & Related papers (2024-12-26T17:53:14Z) - STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z) - VideoEspresso: A Large-Scale Chain-of-Thought Dataset for Fine-Grained Video Reasoning via Core Frame Selection [61.54044967253421]
We introduce VideoEspresso, a novel dataset that features VideoQA pairs preserving essential spatial details and temporal coherence.
Our construction pipeline employs a semantic-aware method to reduce redundancy, followed by generating QA pairs using GPT-4o.
We propose a Hybrid LVLMs Collaboration framework, featuring a Frame Selector and a two-stage instruction fine-tuned reasoning LVLM.
arXiv Detail & Related papers (2024-11-22T08:33:36Z) - Motion-Grounded Video Reasoning: Understanding and Perceiving Motion at Pixel Level [63.18855743293851]
Motion-Grounded Video Reasoning is a new motion understanding task that requires visual answers (video segmentation masks) according to the input question.
This task extends existing grounding work on explicit action/motion grounding to a more general format by enabling implicit reasoning via questions.
We introduce a novel baseline model named Motion-Grounded Video Reasoning Assistant (MORA)
arXiv Detail & Related papers (2024-11-15T03:45:09Z) - Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering [71.62961521518731]
HeurVidQA is a framework that leverages domain-specific entity-actions to refine pre-trained video-language foundation models.
Our approach treats these models as implicit knowledge engines, employing domain-specific entity-action prompters to direct the model's focus toward precise cues that enhance reasoning.
arXiv Detail & Related papers (2024-10-12T06:22:23Z) - Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos [35.974750867072345]
This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos.
We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence.
We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models.
arXiv Detail & Related papers (2024-08-26T17:58:47Z) - VideoAgent: A Memory-augmented Multimodal Agent for Video Understanding [28.316828641898375]
VideoAgent: 1) constructs a structured memory to store both the generic temporal event descriptions and object-centric tracking states of the video.
2) given an input task query, it employs tools including video segment localization and object memory querying along with other visual foundation models to interactively solve the task.
arXiv Detail & Related papers (2024-03-18T05:07:59Z) - MVBench: A Comprehensive Multi-modal Video Understanding Benchmark [63.14000659130736]
We introduce a comprehensive Multi-modal Video understanding Benchmark, namely MVBench.
We first introduce a novel static-to-dynamic method to define these temporal-related tasks.
Then, guided by the task definition, we automatically convert public video annotations into multiple-choice QA to evaluate each task.
arXiv Detail & Related papers (2023-11-28T17:59:04Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.