Video-LLaVA: Learning United Visual Representation by Alignment Before
Projection
- URL: http://arxiv.org/abs/2311.10122v2
- Date: Tue, 21 Nov 2023 14:37:30 GMT
- Title: Video-LLaVA: Learning United Visual Representation by Alignment Before
Projection
- Authors: Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan
- Abstract summary: We introduce Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other.
Video-LLaVA superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits.
Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos.
- Score: 28.39885771124003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Large Vision-Language Model (LVLM) has enhanced the performance of
various downstream tasks in visual-language understanding. Most existing
approaches encode images and videos into separate feature spaces, which are
then fed as inputs to large language models. However, due to the lack of
unified tokenization for images and videos, namely misalignment before
projection, it becomes challenging for a Large Language Model (LLM) to learn
multi-modal interactions from several poor projection layers. In this work, we
unify visual representation into the language feature space to advance the
foundational LLM towards a unified LVLM. As a result, we establish a simple but
robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images
and videos, mutually enhancing each other. Video-LLaVA achieves superior
performances on a broad range of 9 image benchmarks across 5 image
question-answering datasets and 4 image benchmark toolkits. Additionally, our
Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on
MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive
experiments demonstrate that Video-LLaVA mutually benefits images and videos
within a unified visual representation, outperforming models designed
specifically for images or videos. We aim for this work to provide modest
insights into the multi-modal inputs for the LLM.
Related papers
- Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models [29.825619120260164]
This paper addresses the challenge by leveraging the visual commonalities between images and videos to evolve image-LVLMs into video-LVLMs.
We present a cost-effective video-LVLM that enhances model architecture, introduces innovative training strategies, and identifies the most effective types of video instruction data.
arXiv Detail & Related papers (2024-06-12T09:22:45Z) - Dense Connector for MLLMs [89.50595155217108]
We introduce the Dense Connector - a plug-and-play vision-language connector that significantly enhances existing MLLMs.
Our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well.
arXiv Detail & Related papers (2024-05-22T16:25:03Z) - PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
Semantic Vector-Quantized Tokenizer [5.7254320553764]
E-ViLM is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks.
Our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture.
arXiv Detail & Related papers (2023-11-28T22:57:17Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - Boosting Video Representation Learning with Multi-Faceted Integration [112.66127428372089]
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
Existing datasets mostly label only one of the facets for model training, resulting in the video representation that biases to only one facet depending on the training dataset.
We propose a new learning framework, MUlti-Faceted Integration (MUFI), to aggregate facets from different datasets for learning a representation that could reflect the full spectrum of video content.
arXiv Detail & Related papers (2022-01-11T16:14:23Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.