LLMs Meet Long Video: Advancing Long Video Comprehension with An
Interactive Visual Adapter in LLMs
- URL: http://arxiv.org/abs/2402.13546v1
- Date: Wed, 21 Feb 2024 05:56:52 GMT
- Title: LLMs Meet Long Video: Advancing Long Video Comprehension with An
Interactive Visual Adapter in LLMs
- Authors: Yunxin Li, Xinyu Chen, Baotain Hu, Min Zhang
- Abstract summary: Long video understanding is a significant and ongoing challenge in the intersection of multimedia and artificial intelligence.
We present an Interactive Visual Adapter (IVA) within large language models (LLMs) to enhance interaction with fine-grained visual elements.
- Score: 24.79384819644494
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Long video understanding is a significant and ongoing challenge in the
intersection of multimedia and artificial intelligence. Employing large
language models (LLMs) for comprehending video becomes an emerging and
promising method. However, this approach incurs high computational costs due to
the extensive array of video tokens, experiences reduced visual clarity as a
consequence of token aggregation, and confronts challenges arising from
irrelevant visual tokens while answering video-related questions. To alleviate
these issues, we present an Interactive Visual Adapter (IVA) within LLMs,
designed to enhance interaction with fine-grained visual elements.
Specifically, we first transform long videos into temporal video tokens via
leveraging a visual encoder alongside a pretrained causal transformer, then
feed them into LLMs with the video instructions. Subsequently, we integrated
IVA, which contains a lightweight temporal frame selector and a spatial feature
interactor, within the internal blocks of LLMs to capture instruction-aware and
fine-grained visual signals. Consequently, the proposed video-LLM facilitates a
comprehensive understanding of long video content through appropriate long
video modeling and precise visual interactions. We conducted extensive
experiments on nine video understanding benchmarks and experimental results
show that our interactive visual adapter significantly improves the performance
of video LLMs on long video QA tasks. Ablation studies further verify the
effectiveness of IVA in long and short video understandings.
Related papers
- ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - Retrieval-based Video Language Model for Efficient Long Video Question
Answering [39.474247695753725]
We introduce a retrieval-based video language model (R-VLM) for efficient and interpretable long video QA.
Specifically, given a question (query) and a long video, our model identifies and selects the most relevant $K$ video chunks.
Our experimental results validate the effectiveness of our framework for comprehending long videos.
arXiv Detail & Related papers (2023-12-08T09:48:36Z) - VaQuitA: Enhancing Alignment in LLM-Assisted Video Understanding [63.075626670943116]
We introduce a cutting-edge framework, VaQuitA, designed to refine the synergy between video and textual information.
At the data level, instead of sampling frames uniformly, we implement a sampling method guided by CLIP-score rankings.
At the feature level, we integrate a trainable Video Perceiver alongside a Visual-Query Transformer.
arXiv Detail & Related papers (2023-12-04T19:48:02Z) - VTimeLLM: Empower LLM to Grasp Video Moments [43.51980030572101]
Large language models (LLMs) have shown remarkable text understanding capabilities.
Video LLMs can only provide a coarse description of the entire video.
We propose VTimeLLM, a novel Video LLM for fine-grained video moment understanding.
arXiv Detail & Related papers (2023-11-30T10:49:56Z) - VidCoM: Fast Video Comprehension through Large Language Models with Multimodal Tools [44.78291853329394]
textbfVidCoM is a fast adaptive framework that leverages Large Language Models (LLMs) to reason about videos using lightweight visual tools.
An InsOVER algorithm locates the corresponding video events based on an efficient Hungarian matching between decompositions of linguistic instructions and video events.
arXiv Detail & Related papers (2023-10-16T17:05:56Z) - Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video
Understanding [61.80870130860662]
Video-LLaMA is a framework that empowers Large Language Models (LLMs) with the capability of understanding both visual and auditory content in the video.
Video-LLaMA bootstraps cross-modal training from the frozen pre-trained visual and audio encoders and the frozen LLMs.
We found Video-LLaMA shows the ability to perceive and comprehend video content and generate meaningful responses.
arXiv Detail & Related papers (2023-06-05T13:17:27Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.