Advancing High-Resolution Video-Language Representation with Large-Scale
Video Transcriptions
- URL: http://arxiv.org/abs/2111.10337v1
- Date: Fri, 19 Nov 2021 17:36:01 GMT
- Title: Advancing High-Resolution Video-Language Representation with Large-Scale
Video Transcriptions
- Authors: Hongwei Xue, Tiankai Hang, Yanhong Zeng, Yuchong Sun, Bei Liu, Huan
Yang, Jianlong Fu, Baining Guo
- Abstract summary: We study joint and language (VL) pre-training to enable cross-modality learning and benefit plentiful downstream tasks.
Our model achieves new state-of-the-art results in 10 understanding tasks and 2 more novel text-to-visual generation tasks.
- Score: 31.4943447481144
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study joint video and language (VL) pre-training to enable cross-modality
learning and benefit plentiful downstream VL tasks. Existing works either
extract low-quality video features or learn limited text embedding, while
neglecting that high-resolution videos and diversified semantics can
significantly improve cross-modality learning. In this paper, we propose a
novel High-resolution and Diversified VIdeo-LAnguage pre-training model
(HD-VILA) for many visual tasks. In particular, we collect a large dataset with
two distinct properties: 1) the first high-resolution dataset including 371.5k
hours of 720p videos, and 2) the most diversified dataset covering 15 popular
YouTube categories. To enable VL pre-training, we jointly optimize the HD-VILA
model by a hybrid Transformer that learns rich spatiotemporal features, and a
multimodal Transformer that enforces interactions of the learned video features
with diversified texts. Our pre-training model achieves new state-of-the-art
results in 10 VL understanding tasks and 2 more novel text-to-visual generation
tasks. For example, we outperform SOTA models with relative increases of 38.5%
R@1 in zero-shot MSR-VTT text-to-video retrieval task, and 53.6% in
high-resolution dataset LSMDC. The learned VL embedding is also effective in
generating visually pleasing and semantically relevant results in
text-to-visual manipulation and super-resolution tasks.
Related papers
- Video Instruction Tuning With Synthetic Data [84.64519990333406]
We create a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K.
This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA.
By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM.
arXiv Detail & Related papers (2024-10-03T17:36:49Z) - PLLaVA : Parameter-free LLaVA Extension from Images to Videos for Video Dense Captioning [78.23573511641548]
Vision-language pre-training has significantly elevated performance across a wide range of image-language applications.
Yet, the pre-training process for video-related tasks demands exceptionally large computational and data resources.
This paper investigates a straight-forward, highly efficient, and resource-light approach to adapting an existing image-language pre-trained model for video understanding.
arXiv Detail & Related papers (2024-04-25T19:29:55Z) - E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
Semantic Vector-Quantized Tokenizer [5.7254320553764]
E-ViLM is able to learn expressive representations from Video-Language corpus and generalize well to extensive Video-Language tasks.
Our model reaches $39.3$% Top-$1$ accuracy on the MSRVTT benchmark, retaining $91.4$% of the accuracy of state-of-the-art larger VL architecture.
arXiv Detail & Related papers (2023-11-28T22:57:17Z) - VLAB: Enhancing Video Language Pre-training by Feature Adapting and
Blending [78.1399386935455]
Large-scale image-text contrastive pre-training models, such as CLIP, have been demonstrated to effectively learn high-quality multimodal representations.
We propose a novel video-text pre-training method dubbed VLAB: Video Language pre-training by feature generativearity and Blending.
VLAB transfers CLIP representations to video pre-training tasks and develops unified video multimodal models for a wide range of video-text tasks.
arXiv Detail & Related papers (2023-05-22T15:54:22Z) - Learning Video Representations from Large Language Models [31.11998135196614]
We introduce LaViLa, a new approach to learning video-language representations by leveraging Large Language Models (LLMs)
We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators.
Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text.
arXiv Detail & Related papers (2022-12-08T18:59:59Z) - Enabling Multimodal Generation on CLIP via Vision-Language Knowledge
Distillation [79.72299298976525]
We propose to augment a vision-language pre-training model with a textual pre-trained language model (PLM) via vision-language knowledge distillation (VLKD)
Experiments show that the resulting model has strong zero-shot performance on multimodal generation tasks, such as open-ended visual question answering and image captioning.
The original textual language understanding and generation ability of the PLM is maintained after VLKD, which makes our model versatile for both multimodal and unimodal tasks.
arXiv Detail & Related papers (2022-03-12T09:33:37Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - UniVL: A Unified Video and Language Pre-Training Model for Multimodal
Understanding and Generation [76.12027504427708]
This paper proposes UniVL: a Unified Video and Language pre-training model for both multimodal understanding and generation.
It comprises four components, including two single-modal encoders, a cross encoder, and a decoder with the Transformer backbone.
We develop two pre-training strategies, stage by stage pre-training (StagedP) and enhanced video representation (EnhancedV) to make the training process of the UniVL more effective.
arXiv Detail & Related papers (2020-02-15T10:03:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.