LinVT: Empower Your Image-level Large Language Model to Understand Videos
- URL: http://arxiv.org/abs/2412.05185v2
- Date: Wed, 11 Dec 2024 14:43:02 GMT
- Title: LinVT: Empower Your Image-level Large Language Model to Understand Videos
- Authors: Lishuai Gao, Yujie Zhong, Yingsen Zeng, Haoxian Tan, Dengjie Li, Zheng Zhao,
- Abstract summary: Large Language Models (LLMs) have been widely used in various tasks.
We propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs.
We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL.
- Score: 10.159566131070418
- License:
- Abstract: Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.
Related papers
- InternVideo2.5: Empowering Video MLLMs with Long and Rich Context Modeling [56.130911402831906]
This paper aims to improve the performance of video large language models (LM) via long and rich context (LRC) modeling.
We develop a new version of InternVideo2.5 with focus on enhancing the original MLLMs' ability to perceive fine-grained details in videos.
Experimental results demonstrate this unique designML LRC greatly improves the results of video MLLM in mainstream understanding benchmarks.
arXiv Detail & Related papers (2025-01-21T18:59:00Z) - TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models [52.590072198551944]
Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents.
For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data.
In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM.
arXiv Detail & Related papers (2024-11-17T13:08:29Z) - From Image to Video, what do we need in multimodal LLMs? [19.85928004619801]
Multimodal Large Language Models (MLLMs) have demonstrated profound capabilities in understanding multimodal information.
We propose RED-VILLM, a Resource-Efficient Development pipeline for Video LLMs from Image LLMs.
Our approach highlights the potential for a more cost-effective and scalable advancement in multimodal models.
arXiv Detail & Related papers (2024-04-18T02:43:37Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Video Understanding with Large Language Models: A Survey [97.29126722004949]
Given the remarkable capabilities of large language models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of recent advancements in video understanding.
The emergent capabilities Vid-LLMs are surprisingly advanced, particularly their ability for open-ended multi-granularity reasoning.
This survey presents a comprehensive study of the tasks, datasets, benchmarks, and evaluation methodologies for Vid-LLMs.
arXiv Detail & Related papers (2023-12-29T01:56:17Z) - Video-LLaVA: Learning United Visual Representation by Alignment Before Projection [27.04277811443469]
Video-LLaVA learns from a mixed dataset of images and videos, mutually enhancing each other.
Video-LLaVA superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits.
arXiv Detail & Related papers (2023-11-16T10:59:44Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - LiVLR: A Lightweight Visual-Linguistic Reasoning Framework for Video
Question Answering [50.11756459499762]
We propose a Lightweight Visual-Linguistic Reasoning framework named LiVLR.
LiVLR first utilizes the graph-based Visual and Linguistic ablations to obtain multi-grained visual and linguistic representations.
The proposed LiVLR is lightweight and shows its performance advantage on two VideoQA benchmarks.
arXiv Detail & Related papers (2021-11-29T14:18:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.