E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
Semantic Vector-Quantized Tokenizer
- URL: http://arxiv.org/abs/2311.17267v1
- Date: Tue, 28 Nov 2023 22:57:17 GMT
- Title: E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
Semantic Vector-Quantized Tokenizer
- Authors: Jacob Zhiyuan Fang, Skyler Zheng, Vasu Sharma, Robinson Piramuthu
- Abstract summary: E-ViLM is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks.
Our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture.
- Score: 5.7254320553764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To build scalable models for challenging real-world tasks, it is important to
learn from diverse, multi-modal data in various forms (e.g., videos, text, and
images). Among the existing works, a plethora of them have focused on
leveraging large but cumbersome cross-modal architectures. Regardless of their
effectiveness, larger architectures unavoidably prevent the models from being
extended to real-world applications, so building a lightweight VL architecture
and an efficient learning schema is of great practical value. In this paper, we
propose an Efficient Video-Language Model (dubbed as E-ViLM) and a masked video
modeling (MVM) schema, assisted with a semantic vector-quantized tokenizer. In
particular, our E-ViLM learns to reconstruct the semantic labels of masked
video regions, produced by the pre-trained vector-quantized tokenizer, which
discretizes the continuous visual signals into labels. We show that with our
simple MVM task and regular VL pre-training modelings, our E-ViLM, despite its
compactness, is able to learn expressive representations from Video-Language
corpus and generalize well to extensive Video-Language tasks including video
question answering, text-to-video retrieval, etc. In particular, our E-ViLM
obtains obvious efficiency improvements by reaching competing performances with
faster inference speed, i.e., our model reaches $39.3$% Top-$1$ accuracy on the
MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger
VL architecture with only $15%$ parameters and $94.8%$ fewer GFLOPs. We also
provide extensive ablative studies that validate the effectiveness of our
proposed learning schema for E-ViLM.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - Fewer Tokens and Fewer Videos: Extending Video Understanding Abilities in Large Vision-Language Models [29.825619120260164]
This paper addresses the challenge by leveraging the visual commonalities between images and videos to evolve image-LVLMs into video-LVLMs.
We present a cost-effective video-LVLM that enhances model architecture, introduces innovative training strategies, and identifies the most effective types of video instruction data.
arXiv Detail & Related papers (2024-06-12T09:22:45Z) - ST-LLM: Large Language Models Are Effective Temporal Learners [58.79456373423189]
Large Language Models (LLMs) have showcased impressive capabilities in text comprehension and generation.
How to effectively encode and understand videos in video-based dialogue systems remains to be solved.
We propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLM.
arXiv Detail & Related papers (2024-03-30T10:11:26Z) - Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback [38.708690624594794]
Video and text multimodal alignment remains challenging, primarily due to the deficient volume and quality of multimodal instruction-tune data.
We present a novel alignment strategy that employs multimodal AI system to oversee itself called Reinforcement Learning from AI Feedback (RLAIF)
In specific, we propose context-aware reward modeling by providing detailed video descriptions as context during the generation of preference feedback.
arXiv Detail & Related papers (2024-02-06T06:27:40Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation [122.63617171522316]
Large Language Models (LLMs) are the dominant models for generative tasks in language.
We introduce MAGVIT-v2, a video tokenizer designed to generate concise and expressive tokens for both videos and images.
arXiv Detail & Related papers (2023-10-09T14:10:29Z) - GPT4Image: Can Large Pre-trained Models Help Vision Models on Perception
Tasks? [51.22096780511165]
We present a new learning paradigm in which the knowledge extracted from large pre-trained models are utilized to help models like CNN and ViT learn enhanced representations.
We feed detailed descriptions into a pre-trained encoder to extract text embeddings with rich semantic information that encodes the content of images.
arXiv Detail & Related papers (2023-06-01T14:02:45Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision
and Language Models [67.31684040281465]
We present textbfMOV, a simple yet effective method for textbfMultimodal textbfOpen-textbfVocabulary video classification.
In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram.
arXiv Detail & Related papers (2022-07-15T17:59:11Z) - Advancing High-Resolution Video-Language Representation with Large-Scale
Video Transcriptions [31.4943447481144]
We study joint and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream tasks.
Our model achieves new state-of-the-art results in 10 understanding tasks and 2 more novel text-to-visual generation tasks.
arXiv Detail & Related papers (2021-11-19T17:36:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.