VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling
- URL: http://arxiv.org/abs/2111.12681v1
- Date: Wed, 24 Nov 2021 18:31:20 GMT
- Title: VIOLET : End-to-End Video-Language Transformers with Masked Visual-token
Modeling
- Authors: Tsu-Jui Fu and Linjie Li and Zhe Gan and Kevin Lin and William Yang
Wang and Lijuan Wang and Zicheng Liu
- Abstract summary: A great challenge in video-language (VidL) modeling lies in the disconnection between fixed video representations extracted from image/video understanding models and downstream VidL data.
We present VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video transformer to explicitly model the temporal dynamics of video inputs.
- Score: 88.30109041658618
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: A great challenge in video-language (VidL) modeling lies in the disconnection
between fixed video representations extracted from image/video understanding
models and downstream VidL data. Recent studies try to mitigate this
disconnection via end-to-end training. To make it computationally feasible,
prior works tend to "imagify" video inputs, i.e., a handful of sparsely sampled
frames are fed into a 2D CNN, followed by a simple mean-pooling or
concatenation to obtain the overall video representations. Although achieving
promising results, such simple approaches may lose temporal information that is
essential for performing downstream VidL tasks. In this work, we present
VIOLET, a fully end-to-end VIdeO-LanguagE Transformer, which adopts a video
transformer to explicitly model the temporal dynamics of video inputs. Further,
unlike previous studies that found pre-training tasks on video inputs (e.g.,
masked frame modeling) not very effective, we design a new pre-training task,
Masked Visual-token Modeling (MVM), for better video modeling. Specifically,
the original video frame patches are "tokenized" into discrete visual tokens,
and the goal is to recover the original visual tokens based on the masked
patches. Comprehensive analysis demonstrates the effectiveness of both explicit
temporal modeling via video transformer and MVM. As a result, VIOLET achieves
new state-of-the-art performance on 5 video question answering tasks and 4
text-to-video retrieval tasks.
Related papers
- Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity.
This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods.
Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z) - Masked Motion Encoding for Self-Supervised Video Representation Learning [84.24773072241945]
We present Masked Motion MME, a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues.
Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions.
Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details.
arXiv Detail & Related papers (2022-10-12T11:19:55Z) - All in One: Exploring Unified Video-Language Pre-training [44.22059872694995]
We introduce an end-to-end video-language model, namely textitall-in-one Transformer, that embeds raw video and textual signals into joint representations.
The code and pretrained model have been released in https://github.com/showlab/all-in-one.
arXiv Detail & Related papers (2022-03-14T17:06:30Z) - Show Me What and Tell Me How: Video Synthesis via Multimodal
Conditioning [36.85533835408882]
This work presents a multimodal video generation framework that benefits from text and images provided jointly or separately.
We propose a new video token trained with self-learning and an improved mask-prediction algorithm for sampling video tokens.
Our framework can incorporate various visual modalities, such as segmentation masks, drawings, and partially occluded images.
arXiv Detail & Related papers (2022-03-04T21:09:13Z) - Video Transformers: A Survey [42.314208650554264]
We study the contributions and trends for adapting Transformers to model video data.
Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones.
Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches.
arXiv Detail & Related papers (2022-01-16T07:31:55Z) - Align and Prompt: Video-and-Language Pre-training with Entity Prompts [111.23364631136339]
Video-and-language pre-training has shown promising improvements on various downstream tasks.
We propose Align and Prompt: an efficient and effective video-and-language pre-training framework with better cross-modal alignment.
Our code and pre-trained models will be released.
arXiv Detail & Related papers (2021-12-17T15:55:53Z) - VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive
Learning [82.09856883441044]
Video understanding relies on perceiving the global content modeling its internal connections.
We propose a block-wise strategy where we mask neighboring video tokens in both spatial and temporal domains.
We also add an augmentation-free contrastive learning method to further capture global content.
arXiv Detail & Related papers (2021-06-21T16:48:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.