LongVLM: Efficient Long Video Understanding via Large Language Models
- URL: http://arxiv.org/abs/2404.03384v3
- Date: Sat, 20 Jul 2024 07:17:00 GMT
- Title: LongVLM: Efficient Long Video Understanding via Large Language Models
- Authors: Yuetian Weng, Mingfei Han, Haoyu He, Xiaojun Chang, Bohan Zhuang,
- Abstract summary: LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
- Score: 55.813206751150716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Empowered by Large Language Models (LLMs), recent advancements in Video-based LLMs (VideoLLMs) have driven progress in various video understanding tasks. These models encode video representations through pooling or query aggregation over a vast number of visual tokens, making computational and memory costs affordable. Despite successfully providing an overall comprehension of video content, existing VideoLLMs still face challenges in achieving detailed understanding due to overlooking local information in long-term videos. To tackle this challenge, we introduce LongVLM, a simple yet powerful VideoLLM for long video understanding, building upon the observation that long videos often consist of sequential key events, complex actions, and camera movements. Our approach proposes to decompose long videos into multiple short-term segments and encode local features for each segment via a hierarchical token merging module. These features are concatenated in temporal order to maintain the storyline across sequential short-term segments. Additionally, we propose to integrate global semantics into each local feature to enhance context understanding. In this way, we encode video representations that incorporate both local and global information, enabling the LLM to generate comprehensive responses for long-term videos. Experimental results on the VideoChatGPT benchmark and zero-shot video question-answering datasets demonstrate the superior capabilities of our model over the previous state-of-the-art methods. Qualitative examples show that our model produces more precise responses for long video understanding. Code is available at https://github.com/ziplab/LongVLM.
Related papers
- Enhancing Long Video Understanding via Hierarchical Event-Based Memory [9.800516656566774]
We propose a Hierarchical Event-based Memory-enhanced LLM (HEM-LLM) for better understanding of long videos.
Firstly, we design a novel adaptive sequence segmentation scheme to divide multiple events within long videos.
Secondly, while modeling current event, we compress and inject the information of the previous event to enhance the long-term inter-event dependencies in videos.
arXiv Detail & Related papers (2024-09-10T07:53:10Z) - SynopGround: A Large-Scale Dataset for Multi-Paragraph Video Grounding from TV Dramas and Synopses [58.488812405557]
Video grounding aims to localize specific natural language queries in an untrimmed video.
We present a large-scale video grounding dataset named SynopGround.
We introduce a more complex setting of video grounding dubbed Multi-Paragraph Video Grounding (MPVG)
arXiv Detail & Related papers (2024-08-03T05:35:13Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding.
Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z) - Encoding and Controlling Global Semantics for Long-form Video Question Answering [40.129800076300434]
We introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video.
Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations.
To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length.
arXiv Detail & Related papers (2024-05-30T06:10:10Z) - MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding [66.56100008577134]
This study focuses on designing an efficient and effective model for long-term video understanding.
We propose to process videos in an online manner and store past video information in a memory bank.
Our model can achieve state-of-the-art performances across multiple datasets.
arXiv Detail & Related papers (2024-04-08T17:59:24Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.