Retrieval-based Video Language Model for Efficient Long Video Question
Answering
- URL: http://arxiv.org/abs/2312.04931v1
- Date: Fri, 8 Dec 2023 09:48:36 GMT
- Title: Retrieval-based Video Language Model for Efficient Long Video Question
Answering
- Authors: Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu
- Abstract summary: We introduce a retrieval-based video language model (R-VLM) for efficient and interpretable long video QA.
Specifically, given a question (query) and a long video, our model identifies and selects the most relevant $K$ video chunks.
Our experimental results validate the effectiveness of our framework for comprehending long videos.
- Score: 39.474247695753725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable natural language understanding, reasoning, and generation
capabilities of large language models (LLMs) have made them attractive for
application to video question answering (Video QA) tasks, utilizing video
tokens as contextual input. However, employing LLMs for long video
understanding presents significant challenges and remains under-explored. The
extensive number of video tokens leads to considerable computational costs for
LLMs while using aggregated tokens results in loss of vision details. Moreover,
the presence of abundant question-irrelevant tokens introduces noise to the
video QA process. To address these issues, we introduce a simple yet effective
retrieval-based video language model (R-VLM) for efficient and interpretable
long video QA. Specifically, given a question (query) and a long video, our
model identifies and selects the most relevant $K$ video chunks and uses their
associated visual tokens to serve as context for the LLM inference. This
effectively reduces the number of video tokens, eliminates noise interference,
and enhances system performance. Our experimental results validate the
effectiveness of our framework for comprehending long videos. Furthermore,
based on the retrieved chunks, our model is interpretable that provides the
justifications on where we get the answers.
Related papers
- Free Video-LLM: Prompt-guided Visual Perception for Efficient Training-free Video LLMs [56.040198387038025]
We present a novel prompt-guided visual perception framework (abbreviated as Free Video-LLM) for efficient inference of training-free video LLMs.
Our method effectively reduces the number of visual tokens while maintaining high performance across multiple video question-answering benchmarks.
arXiv Detail & Related papers (2024-10-14T12:35:12Z) - Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding [26.72068455284472]
Video-XL is an extra-long vision language model designed for efficient hour-scale video understanding.
Our model achieves promising results on popular long video understanding benchmarks.
arXiv Detail & Related papers (2024-09-22T15:13:31Z) - ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Needle In A Video Haystack: A Scalable Synthetic Evaluator for Video MLLMs [20.168429351519055]
Video understanding is a crucial next step for multimodal large language models (LMLMs)
We propose VideoNIAH (Video Needle In A Haystack), a benchmark construction framework through synthetic video generation.
We conduct a comprehensive evaluation of both proprietary and open-source models, uncovering significant differences in their video understanding capabilities.
arXiv Detail & Related papers (2024-06-13T17:50:05Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - LLMs Meet Long Video: Advancing Long Video Question Answering with An Interactive Visual Adapter in LLMs [22.696090318037925]
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence.
We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
arXiv Detail & Related papers (2024-02-21T05:56:52Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.