LAVENDER: Unifying Video-Language Understanding as Masked Language
Modeling
- URL: http://arxiv.org/abs/2206.07160v1
- Date: Tue, 14 Jun 2022 20:43:25 GMT
- Title: LAVENDER: Unifying Video-Language Understanding as Masked Language
Modeling
- Authors: Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu,
Lijuan Wang
- Abstract summary: Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks.
Experiments show that this unified framework achieves competitive performance on 14 VidL benchmarks.
- Score: 102.42424022921243
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Unified vision-language frameworks have greatly advanced in recent years,
most of which adopt an encoder-decoder architecture to unify image-text tasks
as sequence-to-sequence generation. However, existing video-language (VidL)
models still require task-specific designs in model architecture and training
objectives for each task. In this work, we explore a unified VidL framework
LAVENDER, where Masked Language Modeling (MLM) is used as the common interface
for all pre-training and downstream tasks. Such unification leads to a
simplified model architecture, where only a lightweight MLM head, instead of a
decoder with much more parameters, is needed on top of the multimodal encoder.
Surprisingly, experimental results show that this unified framework achieves
competitive performance on 14 VidL benchmarks, covering video question
answering, text-to-video retrieval and video captioning. Extensive analyses
further demonstrate the advantage of LAVENDER over existing VidL methods in:
(i) supporting all downstream tasks with just a single set of parameter values
when multi-task finetuned; (ii) few-shot generalization on various downstream
tasks; and (iii) enabling zero-shot evaluation on video question answering
tasks. Code is available at https://github.com/microsoft/LAVENDER.
Related papers
- The Surprising Effectiveness of Multimodal Large Language Models for Video Moment Retrieval [36.516226519328015]
Video-language tasks necessitate spatial and temporal comprehension and require significant compute.
This work demonstrates the surprising effectiveness of leveraging image-text pretrained MLLMs for moment retrieval.
We achieve a new state-of-the-art in moment retrieval on the widely used benchmarks Charades-STA, QVHighlights, and ActivityNet Captions.
arXiv Detail & Related papers (2024-06-26T06:59:09Z) - VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks [89.24440488456405]
VisionLLM v2 is an end-to-end generalist multimodal large model (MLLM)
It unifies visual perception, understanding, and generation within a single framework.
arXiv Detail & Related papers (2024-06-12T16:44:50Z) - VideoLLM: Modeling Video Sequence with Large Language Models [70.32832021713864]
Existing video understanding models are often task-specific and lack a comprehensive capability of handling diverse tasks.
We propose a novel framework called VideoLLM that leverages the sequence reasoning capabilities of pre-trained LLMs.
VideoLLM incorporates a carefully designed Modality and Semantic Translator, which convert inputs from various modalities into a unified token sequence.
arXiv Detail & Related papers (2023-05-22T17:51:22Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - HERO: Hierarchical Encoder for Video+Language Omni-representation
Pre-training [75.55823420847759]
We present HERO, a novel framework for large-scale video+language omni-representation learning.
HERO encodes multimodal inputs in a hierarchical structure, where local context of a video frame is captured by a Cross-modal Transformer.
HERO is jointly trained on HowTo100M and large-scale TV datasets to gain deep understanding of complex social dynamics with multi-character interactions.
arXiv Detail & Related papers (2020-05-01T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.