Related papers: HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding

URL: http://arxiv.org/abs/2501.01645v1
Date: Fri, 03 Jan 2025 05:32:37 GMT
Title: HLV-1K: A Large-scale Hour-Long Video Benchmark for Time-Specific Long Video Understanding
Authors: Heqing Zou, Tianze Luo, Guiyang Xie, Victor, Zhang, Fengmao Lv, Guangcong Wang, Junyang Chen, Zhuochen Wang, Hansheng Zhang, Huaijian Zhang,
Abstract summary: We build a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models.<n>HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA)<n>We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks.
Score: 52.696422425058245
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal large language models have become a popular topic in deep visual understanding due to many promising real-world applications. However, hour-long video understanding, spanning over one hour and containing tens of thousands of visual frames, remains under-explored because of 1) challenging long-term video analyses, 2) inefficient large-model approaches, and 3) lack of large-scale benchmark datasets. Among them, in this paper, we focus on building a large-scale hour-long long video benchmark, HLV-1K, designed to evaluate long video understanding models. HLV-1K comprises 1009 hour-long videos with 14,847 high-quality question answering (QA) and multi-choice question asnwering (MCQA) pairs with time-aware query and diverse annotations, covering frame-level, within-event-level, cross-event-level, and long-term reasoning tasks. We evaluate our benchmark using existing state-of-the-art methods and demonstrate its value for testing deep long video understanding capabilities at different levels and for various tasks. This includes promoting future long video understanding tasks at a granular level, such as deep understanding of long live videos, meeting recordings, and movies.

Related papers

Unleashing Hour-Scale Video Training for Long Video-Language Understanding [61.717205915329664]
We present VideoMarathon, a large-scale hour-long video instruction-following dataset.<n>This dataset includes around 9,700 hours of long videos sourced from diverse domains, ranging from 3 to 60 minutes per video.<n>We propose Hour-LLaVA, a powerful and efficient Video-LMM for hour-scale video-language modeling.
arXiv Detail & Related papers (2025-06-05T17:59:04Z)
SAMA: Towards Multi-Turn Referential Grounded Video Chat with Large Language Models [80.3895950009792]
Achieving fine-grained-temporal understanding in videos remains a major challenge for current Video Large Multimodels (Video LMMs)<n>We contribute in three core aspects: dataset, model, and benchmark.<n>First, we introduce SAMA-239K, a large-scale dataset comprising 15K videos specifically to enable joint learning of video understanding, grounding, and multi-turn video chat.<n>Second, we propose the SAMA model, which incorporates a versatile-temporal context aggregator and a Segment Model to jointly enhance fine-grained video comprehension and precise grounding capabilities.
arXiv Detail & Related papers (2025-05-24T18:13:16Z)
MomentSeeker: A Task-Oriented Benchmark For Long-Video Moment Retrieval [61.414236415351446]
We propose MomentSeeker, a novel benchmark for long-video moment retrieval (LMVR)<n>MomentSeeker is created based on long and diverse videos, averaging over 1200 seconds in duration.<n>It covers a variety of real-world scenarios in three levels: global-level, event-level, object-level, covering common tasks like action recognition, object localization, and causal reasoning.
arXiv Detail & Related papers (2025-02-18T05:50:23Z)
VideoRefer Suite: Advancing Spatial-Temporal Object Understanding with Video LLM [81.15525024145697]
Video Large Language Models (Video LLMs) have recently exhibited remarkable capabilities in general video understanding. However, they mainly focus on holistic comprehension and struggle with capturing fine-grained spatial and temporal details. We introduce the VideoRefer Suite to empower Video LLM for finer-level spatial-temporal video understanding.
arXiv Detail & Related papers (2024-12-31T18:56:46Z)
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
Long-context video modeling is critical for multimodal large language models (MLLMs) This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark. We build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks.
arXiv Detail & Related papers (2024-12-31T18:01:23Z)
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis [52.050036778325094]
We introduce SALOVA: Segment-Augmented Video Assistant, a novel video-LLM framework designed to enhance the comprehension of lengthy video content. We present a high-quality collection of 87.8K long videos, each densely captioned at the segment level to enable models to capture scene continuity and maintain rich context. Our framework mitigates the limitations of current video-LMMs by allowing for precise identification and retrieval of relevant video segments in response to queries.
arXiv Detail & Related papers (2024-11-25T08:04:47Z)
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding [41.9477837230283]
LongVideoBench is a question-answering benchmark that features video-language interleaved inputs up to an hour long. Our benchmark includes 3,763 varying-length web-collected videos with their subtitles across diverse themes. We formulate a novel video question-answering task termed referring reasoning.
arXiv Detail & Related papers (2024-07-22T16:00:55Z)
DrVideo: Document Retrieval Based Long Video Understanding [44.34473173458403]
DrVideo is a document-retrieval-based system designed for long video understanding. It first transforms a long video into a coarse text-based long document to retrieve key frames and then updates the documents with the augmented key frame information. It then employs an agent-based iterative loop to continuously search for missing information and augment the document until sufficient question-related information is gathered.
arXiv Detail & Related papers (2024-06-18T17:59:03Z)
LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding. Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z)
Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos. Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z)
LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding. We encode video representations that incorporate both local and global information. Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z)
MoVQA: A Benchmark of Versatile Question-Answering for Long-Form Movie Understanding [69.04413943858584]
We introduce MoVQA, a long-form movie question-answering dataset. We also benchmark to assess the diverse cognitive capabilities of multimodal systems.
arXiv Detail & Related papers (2023-12-08T03:33:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.