Retrieval-based Video Language Model for Efficient Long Video Question
Answering
- URL: http://arxiv.org/abs/2312.04931v1
- Date: Fri, 8 Dec 2023 09:48:36 GMT
- Title: Retrieval-based Video Language Model for Efficient Long Video Question
Answering
- Authors: Jiaqi Xu, Cuiling Lan, Wenxuan Xie, Xuejin Chen, Yan Lu
- Abstract summary: We introduce a retrieval-based video language model (R-VLM) for efficient and interpretable long video QA.
Specifically, given a question (query) and a long video, our model identifies and selects the most relevant $K$ video chunks.
Our experimental results validate the effectiveness of our framework for comprehending long videos.
- Score: 39.474247695753725
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The remarkable natural language understanding, reasoning, and generation
capabilities of large language models (LLMs) have made them attractive for
application to video question answering (Video QA) tasks, utilizing video
tokens as contextual input. However, employing LLMs for long video
understanding presents significant challenges and remains under-explored. The
extensive number of video tokens leads to considerable computational costs for
LLMs while using aggregated tokens results in loss of vision details. Moreover,
the presence of abundant question-irrelevant tokens introduces noise to the
video QA process. To address these issues, we introduce a simple yet effective
retrieval-based video language model (R-VLM) for efficient and interpretable
long video QA. Specifically, given a question (query) and a long video, our
model identifies and selects the most relevant $K$ video chunks and uses their
associated visual tokens to serve as context for the LLM inference. This
effectively reduces the number of video tokens, eliminates noise interference,
and enhances system performance. Our experimental results validate the
effectiveness of our framework for comprehending long videos. Furthermore,
based on the retrieved chunks, our model is interpretable that provides the
justifications on where we get the answers.
Related papers
- ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos [67.78336281317347]
VideoTree is a queryadaptive and hierarchical framework for long-video understanding with Large Language Models.
VideoTree adaptively selects frames for captioning by iteratively clustering frames based on their visual features.
It organizes visual clusters into a query-adaptive and hierarchical tree structure.
arXiv Detail & Related papers (2024-05-29T15:49:09Z) - LongVLM: Efficient Long Video Understanding via Large Language Models [55.813206751150716]
LongVLM is a simple yet powerful VideoLLM for long video understanding.
We encode video representations that incorporate both local and global information.
Our model produces more precise responses for long video understanding.
arXiv Detail & Related papers (2024-04-04T11:33:29Z) - LLMs Meet Long Video: Advancing Long Video Comprehension with An
Interactive Visual Adapter in LLMs [24.79384819644494]
Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence.
We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
arXiv Detail & Related papers (2024-02-21T05:56:52Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.