DrVideo: Document Retrieval Based Long Video Understanding
- URL: http://arxiv.org/abs/2406.12846v1
- Date: Tue, 18 Jun 2024 17:59:03 GMT
- Title: DrVideo: Document Retrieval Based Long Video Understanding
- Authors: Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, Jianfei Cai,
- Abstract summary: DrVideo is a document-retrieval-based system designed for long video understanding.
It transforms a long video into a text-based long document to retrieve key frames and augment the information of these frames.
It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions.
- Score: 44.34473173458403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing methods for long video understanding primarily focus on videos only lasting tens of seconds, with limited exploration of techniques for handling longer videos. The increased number of frames in longer videos presents two main challenges: difficulty in locating key information and performing long-range reasoning. Thus, we propose DrVideo, a document-retrieval-based system designed for long video understanding. Our key idea is to convert the long-video understanding problem into a long-document understanding task so as to effectively leverage the power of large language models. Specifically, DrVideo transforms a long video into a text-based long document to initially retrieve key frames and augment the information of these frames, which is used this as the system's starting point. It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions in a chain-of-thought manner once sufficient question-related information is gathered. Extensive experiments on long video benchmarks confirm the effectiveness of our method. DrVideo outperforms existing state-of-the-art methods with +3.8 accuracy on EgoSchema benchmark (3 minutes), +17.9 in MovieChat-1K break mode, +38.0 in MovieChat-1K global mode (10 minutes), and +30.2 on the LLama-Vid QA dataset (over 60 minutes).
Related papers
- Unleashing Hour-Scale Video Training for Long Video-Language Understanding [61.717205915329664]
We present VideoMarathon, a large-scale hour-long video instruction-following dataset.<n>This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video.<n>We propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling.
arXiv Detail & Related papers (2025-06-05T17:59:04Z) - Deep Video Discovery: Agentic Search with Tool Use for Long-form Video Understanding [63.82450803014141]
Long-form video understanding presents significant challenges due to extensive temporal-spatial complexity.<n>We propose the Deep Video Discovery agent to leverage an agentic search strategy over segmented video clips.<n>Our DVD agent achieves SOTA performance, significantly surpassing prior works by a large margin on the challenging LVBench dataset.
arXiv Detail & Related papers (2025-05-23T16:37:36Z) - MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval [61.414236415351446]
We propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR)<n>MomentSeeker is created based on long and diverse videos, averaging over 1200 seconds in duration.<n>It covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning.
arXiv Detail & Related papers (2025-02-18T05:50:23Z) - HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding [52.696422425058245]
We build a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models.
HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA)
We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks.
arXiv Detail & Related papers (2025-01-03T05:32:37Z) - VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding.
However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details.
We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z) - VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
Long-context video modeling is critical for multimodal large language models (MLLMs)
This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark.
We build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks.
arXiv Detail & Related papers (2024-12-31T18:01:23Z) - SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content.
We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context.
Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z) - Goldfish: Vision-Language Understanding of Arbitrarily Long Videos [51.547065479762715]
We present a methodology tailored for comprehending videos of arbitrary lengths.
We also introduce the TVQA-long benchmark, designed to evaluate models' capabilities in understanding long videos with questions in both vision and text content.
Our results indicate that our models have significant improvements in both long and short-video understanding.
arXiv Detail & Related papers (2024-07-17T15:59:32Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding.
Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z) - VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos [67.78336281317347]
Long-form video understanding has been a challenging task due to the high redundancy in video data.
We propose VideoTree, a training-free framework which builds a query-adaptive and hierarchical video representation.
Our experiments show that our framework improves both reasoning accuracy and efficiency compared to existing methods.
arXiv Detail & Related papers (2024-05-29T15:49:09Z) - Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos.
Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks.
Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - LVCHAT: Facilitating Long Video Comprehension [25.395689904747965]
We propose Long Video Chat (LVChat) to enable multimodal large language models (LLMs) to read videos.
LV significantly outperforms existing methods by up to 27% in accuracy on long-video QA datasets and long-video captioning benchmarks.
arXiv Detail & Related papers (2024-02-19T11:59:14Z) - Query-aware Long Video Localization and Relation Discrimination for Deep
Video Understanding [15.697251303126874]
Deep Video Understanding (DVU) Challenge aims to push the boundaries of multimodal extraction, fusion, and analytics.
This paper introduces a query-aware method for long video localization and relation discrimination, leveraging an imagelanguage pretrained model.
Our approach achieved first and fourth positions for two groups of movie-level queries.
arXiv Detail & Related papers (2023-10-19T13:26:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.