HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training
- URL: http://arxiv.org/abs/2005.00200v2
- Date: Tue, 29 Sep 2020 20:37:17 GMT
- Title: HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training
- Authors: Linjie Li, Yen-Chun Chen, Yu Cheng, Zhe Gan, Licheng Yu, Jingjing Liu
- Abstract summary: We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
- Score: 75.55823420847759
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present HERO, a novel framework for large-scale video+language
omni-representation learning. HERO encodes multimodal inputs in a hierarchical
structure, where local context of a video frame is captured by a Cross-modal
Transformer via multimodal fusion, and global video context is captured by a
Temporal Transformer. In addition to standard Masked Language Modeling (MLM)
and Masked Frame Modeling (MFM) objectives, we design two new pre-training
tasks: (i) Video-Subtitle Matching (VSM), where the model predicts both global
and local temporal alignment; and (ii) Frame Order Modeling (FOM), where the
model predicts the right order of shuffled video frames. HERO is jointly
trained on HowTo100M and large-scale TV datasets to gain deep understanding of
complex social dynamics with multi-character interactions. Comprehensive
experiments demonstrate that HERO achieves new state of the art on multiple
benchmarks over Text-based Video/Video-moment Retrieval, Video Question
Answering (QA), Video-and-language Inference and Video Captioning tasks across
different domains. We also introduce two new challenging benchmarks How2QA and
How2R for Video QA and Retrieval, collected from diverse video content over
multimodalities.
Related papers
- InternVideo2: Scaling Foundation Models for Multimodal Video Understanding [51.129913789991924]
InternVideo2 is a new family of video foundation models (FM) that achieve state-of-the-art results in video recognition, video-speech tasks, and video-centric tasks.
Our core design is a progressive training approach that unifies the masked video modeling, cross contrastive learning, and prediction token, scaling up to 6B video size.
arXiv Detail & Related papers (2024-03-22T17:57:42Z) - Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional Tokenization [52.63845811751936]
Video pre-training is challenging due to the modeling of its dynamics video.
In this paper, we address such limitations in video pre-training with an efficient video decomposition.
Our framework is both capable of comprehending and generating image and video content, as demonstrated by its performance across 13 multimodal benchmarks.
arXiv Detail & Related papers (2024-02-05T16:30:49Z) - Self-Chained Image-Language Model for Video Localization and Question
Answering [66.86740990630433]
We propose Self-Chained Video-Answering (SeViLA) framework to tackle both temporal localization and QA on videos.
SeViLA framework consists of two modules: Localizer and Answerer, where both are parameter-efficiently fine-tuned from BLIP-2.
arXiv Detail & Related papers (2023-05-11T17:23:00Z) - LAVENDER: Unifying Video-Language Understanding as Masked Language
Modeling [102.42424022921243]
Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks.
Experiments show that this unified framework achieves competitive performance on 14 VidL benchmarks.
arXiv Detail & Related papers (2022-06-14T20:43:25Z) - Dense-Caption Matching and Frame-Selection Gating for Temporal
Localization in VideoQA [96.10612095576333]
We propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions.
Our model is also comprised of dual-level attention (word/object and frame level), multi-head self-cross-integration for different sources (video and dense captions), and which pass more relevant information to gates.
We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin.
arXiv Detail & Related papers (2020-05-13T16:35:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.