Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens
- URL: http://arxiv.org/abs/2312.08870v1
- Date: Tue, 12 Dec 2023 09:47:59 GMT
- Title: Vista-LLaMA: Reliable Video Narrator via Equal Distance to Visual Tokens
- Authors: Fan Ma, Xiaojie Jin, Heng Wang, Yuchen Xian, Jiashi Feng, Yi Yang
- Abstract summary: Vista-LLaMA is a novel framework that maintains the consistent distance between all visual tokens and any language tokens.
We present a sequential visual projector that projects the current video frame into tokens of language space with the assistance of the previous frame.
- Score: 70.80127538938093
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large video-language models have displayed promising
outcomes in video comprehension. Current approaches straightforwardly convert
video into language tokens and employ large language models for multi-modal
tasks. However, this method often leads to the generation of irrelevant
content, commonly known as "hallucination", as the length of the text increases
and the impact of the video diminishes. To address this problem, we propose
Vista-LLaMA, a novel framework that maintains the consistent distance between
all visual tokens and any language tokens, irrespective of the generated text
length. Vista-LLaMA omits relative position encoding when determining attention
weights between visual and text tokens, retaining the position encoding for
text and text tokens. This amplifies the effect of visual tokens on text
generation, especially when the relative distance is longer between visual and
text tokens. The proposed attention mechanism significantly reduces the chance
of producing irrelevant text related to the video content. Furthermore, we
present a sequential visual projector that projects the current video frame
into tokens of language space with the assistance of the previous frame. This
approach not only captures the temporal relationship within the video, but also
allows less visual tokens to encompass the entire video. Our approach
significantly outperforms various previous methods (e.g., Video-ChatGPT,
MovieChat) on four challenging open-ended video question answering benchmarks.
We reach an accuracy of 60.7 on the zero-shot NExT-QA and 60.5 on the zero-shot
MSRVTT-QA, setting a new state-of-the-art performance. This project is
available at https://jinxxian.github.io/Vista-LLaMA.
Related papers
- ViLLa: Video Reasoning Segmentation with Large Language Model [48.75470418596875]
We propose a new video segmentation task - video reasoning segmentation.
The task is designed to output tracklets of segmentation masks given a complex input text query.
We present ViLLa: Video reasoning segmentation with a Large Language Model.
arXiv Detail & Related papers (2024-07-18T17:59:17Z) - Long Context Transfer from Language to Vision [74.78422371545716]
Video sequences offer valuable temporal information, but existing large multimodal models (LMMs) fall short in understanding extremely long videos.
In this paper, we approach this problem from the perspective of the language model.
By simply extrapolating the context length of the language backbone, we enable LMMs to comprehend orders of magnitude more visual tokens without any video training.
arXiv Detail & Related papers (2024-06-24T17:58:06Z) - OmniVid: A Generative Framework for Universal Video Understanding [133.73878582161387]
We seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens.
This enables us to address various types of video tasks, including classification, captioning, and localization.
We demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results.
arXiv Detail & Related papers (2024-03-26T17:59:24Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models [66.40252169137447]
We present a novel method to tackle the token generation challenge in Vision Language Models (VLMs) for video and image understanding, called LLaMA-VID.
LLaMA-VID addresses this issue by representing each frame with two distinct tokens, namely context token and content token.
This dual-token strategy significantly reduces the overload of long videos while preserving critical information.
arXiv Detail & Related papers (2023-11-28T18:53:43Z) - Phenaki: Variable Length Video Generation From Open Domain Textual
Description [21.610541668826006]
Phenaki is a model capable of realistic video synthesis given a sequence of textual prompts.
New model for learning video representation compresses the video to a small representation of discrete tokens.
To the best of our knowledge, this is the first time a paper studies generating videos from time variable prompts.
arXiv Detail & Related papers (2022-10-05T17:18:28Z) - TVLT: Textless Vision-Language Transformer [89.31422264408002]
We present the Textless Vision-Language Transformer (TVLT), where homogeneous transformer blocks take raw visual and audio inputs.
TVLT attains performance comparable to its text-based counterpart, on various multimodal tasks.
Our findings suggest the possibility of learning compact and efficient visual-linguistic representations from low-level visual and audio signals.
arXiv Detail & Related papers (2022-09-28T15:08:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.