Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning
- URL: http://arxiv.org/abs/2210.06031v1
- Date: Wed, 12 Oct 2022 09:08:27 GMT
- Title: Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning
- Authors: Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, Jianlong Fu
- Abstract summary: Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
- Score: 39.80936685227549
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale video-language pre-training has shown significant improvement in
video-language understanding tasks. Previous studies of video-language
pretraining mainly focus on short-form videos (i.e., within 30 seconds) and
sentences, leaving long-form video-language pre-training rarely explored.
Directly learning representation from long-form videos and language may benefit
many long-form video-language understanding tasks. However, it is challenging
due to the difficulty of modeling long-range relationships and the heavy
computational burden caused by more frames. In this paper, we introduce a
Long-Form VIdeo-LAnguage pre-training model (LF-VILA) and train it on a
large-scale long-form video and paragraph dataset constructed from an existing
public dataset. To effectively capture the rich temporal dynamics and to better
align video and language in an efficient end-to-end manner, we introduce two
novel designs in our LF-VILA model. We first propose a Multimodal Temporal
Contrastive (MTC) loss to learn the temporal relation across different
modalities by encouraging fine-grained alignment between long-form videos and
paragraphs. Second, we propose a Hierarchical Temporal Window Attention (HTWA)
mechanism to effectively capture long-range dependency while reducing
computational cost in Transformer. We fine-tune the pre-trained LF-VILA model
on seven downstream long-form video-language understanding tasks of
paragraph-to-video retrieval and long-form video question-answering, and
achieve new state-of-the-art performances. Specifically, our model achieves
16.1% relative improvement on ActivityNet paragraph-to-video retrieval task and
2.4% on How2QA task, respectively. We release our code, dataset, and
pre-trained models at https://github.com/microsoft/XPretrain.
Related papers
- Contrastive Language Video Time Pre-training [12.876308881183371]
We introduce LAVITI, a novel approach to learning language, video, and temporal representations in long-form videos via contrastive learning.
Our model employs a set of learnable moment queries to decode clip-level visual, language, and temporal features.
We validated our method on CharadesEgo action recognition, achieving state-of-the-art results.
arXiv Detail & Related papers (2024-06-04T02:48:59Z) - VidLA: Video-Language Alignment at Scale [48.665918882615195]
We propose VidLA, an approach for video-language alignment at scale.
Our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks.
arXiv Detail & Related papers (2024-03-21T22:36:24Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language
Understanding [20.16000249533665]
TESTA condenses video semantics by adaptively aggregating similar frames, as well as similar patches within each frame.
Building upon TESTA, we introduce a pre-trained video-language model equipped with a divided space-time token aggregation module in each video block.
We evaluate our model on five datasets for paragraph-to-video retrieval and long-form VideoQA tasks.
arXiv Detail & Related papers (2023-10-29T16:25:32Z) - HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training [49.52679453475878]
We propose a Temporal-Aware video-language pre-training framework, HiTeA, for modeling cross-modal alignment between moments and texts.
We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks.
arXiv Detail & Related papers (2022-12-30T04:27:01Z) - LiteVL: Efficient Video-Language Learning with Enhanced Spatial-Temporal
Modeling [48.283659682112926]
We propose LiteVL, which adapts a pre-trained image-language model BLIP into a video-text model directly on downstream tasks.
We also propose a non-parametric pooling mechanism to adaptively reweight the fine-grained video embedding conditioned on the text.
arXiv Detail & Related papers (2022-10-21T13:03:49Z) - Revisiting the "Video" in Video-Language Understanding [56.15777956496518]
We propose the atemporal probe (ATP), a new model for video-language analysis.
We characterize the limitations and potential of current video-language benchmarks.
We show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
arXiv Detail & Related papers (2022-06-03T17:57:33Z) - Understanding Chinese Video and Language via Contrastive Multimodal
Pre-Training [79.88705563918413]
We propose a novel video-language understanding framework named VICTOR, which stands for VIdeo-language understanding via Contrastive mulTimOdal pRe-training.
VICTOR is trained on a large-scale Chinese video-language dataset, including over 10 million complete videos with corresponding high-quality textual descriptions.
arXiv Detail & Related papers (2021-04-19T15:58:45Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.