Related papers: LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models

URL: http://arxiv.org/abs/2311.17043v1
Date: Tue, 28 Nov 2023 18:53:43 GMT
Title: LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Authors: Yanwei Li, Chengyao Wang, Jiaya Jia
Abstract summary: We present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. This dual-token strategy significantly reduces the overload of long videos while preserving critical information.
Score: 66.40252169137447
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID. Current VLMs, while proficient in tasks like image captioning and visual question answering, face computational burdens when processing long videos due to the excessive visual tokens. LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token. The context token encodes the overall image context based on user input, whereas the content token encapsulates visual cues in each frame. This dual-token strategy significantly reduces the overload of long videos while preserving critical information. Generally, LLaMA-VID empowers existing frameworks to support hour-long videos and pushes their upper limit with an extra context token. It is proved to surpass previous methods on most of video- or image-based benchmarks. Code is available https://github.com/dvlab-research/LLaMA-VID}{https://github.com/dvlab-research/LLaMA-VID

Related papers

Multimodal Long Video Modeling Based on Temporal Dynamic Context [13.979661295432964]
We propose a dynamic long video encoding method utilizing the temporal relationship between frames, named Temporal Dynamic Context (TDC) We segment the video into semantically consistent scenes based on inter-frame similarities, then encode each frame into tokens using visual-audio encoders. To handle extremely long videos, we propose a training-free chain-of-thought strategy that progressively extracts answers from multiple video segments.
arXiv Detail & Related papers (2025-04-14T17:34:06Z)
B-VLLM: A Vision Large Language Model with Balanced Spatio-Temporal Tokens [34.83004283826509]
Large Language Models (VLLMs) integrated with vision encoders have shown promising performance in visions understanding. We present Balanced Vision-VLLM (B-VLLM): a novel VLLM framework that aims to effectively leverage task relevant-temporal cues. B-VLLM is effective in balancing the number of frames and visual tokens in video understanding, yielding superior performance on various video understanding.
arXiv Detail & Related papers (2024-12-13T07:13:40Z)
Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos. In this paper, we approach this problem from the perspective of the language model. By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z)
Auto-Encoding Morph-Tokens for Multimodal LLM [151.2618346912529]
We propose encoding images into morph-tokens to serve a dual purpose: for comprehension, they act as visual prompts instructing MLLM to generate texts. Experiments show that morph-tokens can achieve a new SOTA for multimodal comprehension and generation simultaneously.
arXiv Detail & Related papers (2024-05-03T08:43:06Z)
Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens [70.80127538938093]
Vista-LLaMA is a novel framework that maintains the consistent distance between all visual tokens and any language tokens. We present a sequential visual projector that projects the current video frame into tokens of language space with the assistance of the previous frame.
arXiv Detail & Related papers (2023-12-12T09:47:59Z)
Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation [122.63617171522316]
Large Language Models (LLMs) are the dominant models for generative tasks in language. We introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images.
arXiv Detail & Related papers (2023-10-09T14:10:29Z)
Leveraging per Image-Token Consistency for Vision-Language Pre-training [52.825150269820696]
Cross-modal masked language modeling (CMLM) is insufficient for vision-language pre-training. We propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training) The proposed EPIC method is easily combined with pre-training methods.
arXiv Detail & Related papers (2022-11-20T12:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.