LVCHAT: Facilitating Long Video Comprehension
- URL: http://arxiv.org/abs/2402.12079v1
- Date: Mon, 19 Feb 2024 11:59:14 GMT
- Title: LVCHAT: Facilitating Long Video Comprehension
- Authors: Yu Wang, Zeyuan Zhang, Julian McAuley, Zexue He
- Abstract summary: We propose Long Video Chat (LVChat) to enable multimodal large language models (LLMs) to read videos.
LV significantly outperforms existing methods by up to 27% in accuracy on long-video QA datasets and long-video captioning benchmarks.
- Score: 25.395689904747965
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Enabling large language models (LLMs) to read videos is vital for multimodal
LLMs. Existing works show promise on short videos whereas long video (longer
than e.g.~1 minute) comprehension remains challenging. The major problem lies
in the over-compression of videos, i.e., the encoded video representations are
not enough to represent the whole video. To address this issue, we propose Long
Video Chat (LVChat), where Frame-Scalable Encoding (FSE) is introduced to
dynamically adjust the number of embeddings in alignment with the duration of
the video to ensure long videos are not overly compressed into a few
embeddings. To deal with long videos whose length is beyond videos seen during
training, we propose Interleaved Frame Encoding (IFE), repeating positional
embedding and interleaving multiple groups of videos to enable long video
input, avoiding performance degradation due to overly long videos. Experimental
results show that LVChat significantly outperforms existing methods by up to
27\% in accuracy on long-video QA datasets and long-video captioning
benchmarks. Our code is published at https://github.com/wangyu-ustc/LVChat.
Related papers
- LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding [65.46303012350207]
LongVU is an adaptive compression mechanism that reduces the number of video tokens while preserving visual details of long videos.
We leverage DINOv2 features to remove redundant frames that exhibit high similarity.
We perform spatial token reduction across frames based on their temporal dependencies.
arXiv Detail & Related papers (2024-10-22T21:21:37Z) - LVD-2M: A Long-take Video Dataset with Temporally Dense Captions [68.88624389174026]
We introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions.
Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality.
We curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions.
arXiv Detail & Related papers (2024-10-14T17:59:56Z) - DrVideo: Document Retrieval Based Long Video Understanding [44.34473173458403]
DrVideo is a document-retrieval-based system designed for long video understanding.
It transforms a long video into a text-based long document to retrieve key frames and augment the information of these frames.
It then employs an agent-based iterative loop to continuously search for missing information, augment relevant data, and provide final predictions.
arXiv Detail & Related papers (2024-06-18T17:59:03Z) - LVBench: An Extreme Long Video Understanding Benchmark [38.839913137854104]
We introduce LVBench, a benchmark specifically designed for long video understanding.
Our dataset comprises publicly sourced videos and encompasses a diverse set of tasks aimed at long video comprehension and information extraction.
arXiv Detail & Related papers (2024-06-12T09:36:52Z) - Encoding and Controlling Global Semantics for Long-form Video Question Answering [40.129800076300434]
We introduce a state space layer (SSL) into multi-modal Transformer to efficiently integrate global semantics of the video.
Our SSL includes a gating unit to enable controllability over the flow of global semantics into visual representations.
To rigorously evaluate long-form videoQA capacity, we construct two new benchmarks Ego-QA and MAD-QA featuring videos of considerably long length.
arXiv Detail & Related papers (2024-05-30T06:10:10Z) - Streaming Long Video Understanding with Large Language Models [83.11094441893435]
VideoStreaming is an advanced vision-language large model (VLLM) for video understanding.
It capably understands arbitrary-length video with a constant number of video streaming tokens encoded and propagatedly selected.
Our model achieves superior performance and higher efficiency on long video benchmarks.
arXiv Detail & Related papers (2024-05-25T02:22:09Z) - Koala: Key frame-conditioned long video-LLM [70.52369588364992]
We propose a lightweight and self-supervised long video-LLM (Koala) to adapt pretrained vLLMs for generalizing to longer videos.
Our approach outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks.
Surprisingly, we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition.
arXiv Detail & Related papers (2024-04-05T18:33:04Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound [103.28102473127748]
We introduce an audiovisual method for long-range text-to-video retrieval.
Our approach aims to retrieve minute-long videos that capture complex human actions.
Our method is 2.92x faster and 2.34x memory-efficient than long-range video-only approaches.
arXiv Detail & Related papers (2022-04-06T14:43:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.