EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the
Backbone
- URL: http://arxiv.org/abs/2307.05463v2
- Date: Sat, 19 Aug 2023 03:38:55 GMT
- Title: EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the
Backbone
- Authors: Shraman Pramanick, Yale Song, Sayan Nag, Kevin Qinghong Lin, Hardik
Shah, Mike Zheng Shou, Rama Chellappa, and Pengchuan Zhang
- Abstract summary: Video-language pre-training can generalize to various vision and language tasks.
Video-language pre-training frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning.
New generation of egocentric video-language pre-training incorporates cross-modal fusion directly into the video and language backbones.
- Score: 67.13773226242242
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video-language pre-training (VLP) has become increasingly important due to
its ability to generalize to various vision and language tasks. However,
existing egocentric VLP frameworks utilize separate video and language encoders
and learn task-specific cross-modal information only during fine-tuning,
limiting the development of a unified system. In this work, we introduce the
second generation of egocentric video-language pre-training (EgoVLPv2), a
significant improvement from the previous generation, by incorporating
cross-modal fusion directly into the video and language backbones. EgoVLPv2
learns strong video-text representation during pre-training and reuses the
cross-modal attention modules to support different downstream tasks in a
flexible and efficient manner, reducing fine-tuning costs. Moreover, our
proposed fusion in the backbone strategy is more lightweight and
compute-efficient than stacking additional fusion-specific layers. Extensive
experiments on a wide range of VL tasks demonstrate the effectiveness of
EgoVLPv2 by achieving consistent state-of-the-art performance over strong
baselines across all downstream. Our project page can be found at
https://shramanpramanick.github.io/EgoVLPv2/.
Related papers
- NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks.
We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities.
We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z) - 3rd Place Solution for MeViS Track in CVPR 2024 PVUW workshop: Motion Expression guided Video Segmentation [13.622700558266658]
We propose using frozen pre-trained vision-language models (VLM) as backbones, with a specific emphasis on enhancing cross-modal feature interaction.
Firstly, we use frozen convolutional CLIP backbone to generate feature-aligned vision and text features, alleviating the issue of domain gap.
Secondly, we add more cross-modal feature fusion in the pipeline to enhance the utilization of multi-modal information.
arXiv Detail & Related papers (2024-06-07T11:15:03Z) - VILA: On Pre-training for Visual Language Models [74.08039416548209]
We study the design options for VLM pre-training through step-by-step controllable comparisons.
We build VILA, a Visual Language model family that consistently outperforms the state-of-the-art models.
arXiv Detail & Related papers (2023-12-12T18:58:18Z) - RC3: Regularized Contrastive Cross-lingual Cross-modal Pre-training [84.23022072347821]
We propose a regularized cross-lingual visio-textual contrastive learning objective that constrains the representation proximity of weakly-aligned visio-textual inputs.
Experiments on 5 downstream multi-modal tasks across 6 languages demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2023-05-13T14:41:05Z) - Long-Form Video-Language Pre-Training with Multimodal Temporal
Contrastive Learning [39.80936685227549]
Large-scale video-language pre-training has shown significant improvement in video-language understanding tasks.
We introduce a Long-Form VIdeo-LAnguage pre-training model (VILA) and train it on a large-scale long-form video and paragraph dataset.
We fine-tune the model on seven downstream long-form video-language understanding tasks, achieve new state-of-the-art performances.
arXiv Detail & Related papers (2022-10-12T09:08:27Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - Advancing High-Resolution Video-Language Representation with Large-Scale
Video Transcriptions [31.4943447481144]
We study joint and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream tasks.
Our model achieves new state-of-the-art results in 10 understanding tasks and 2 more novel text-to-visual generation tasks.
arXiv Detail & Related papers (2021-11-19T17:36:01Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.