VLM: Task-agnostic Video-Language Model Pre-training for Video
Understanding
- URL: http://arxiv.org/abs/2105.09996v1
- Date: Thu, 20 May 2021 19:13:27 GMT
- Title: VLM: Task-agnostic Video-Language Model Pre-training for Video
Understanding
- Authors: Hu Xu, Gargi Ghosh, Po-Yao Huang, Prahal Arora, Masoumeh Aminzadeh,
Christoph Feichtenhofer, Florian Metze, Luke Zettlemoyer
- Abstract summary: We present a task-agnostic multi-modal pre-training approach that can accept either video or text input, or both for a variety of end tasks.
Experimental results show strong performance across a wider range of tasks than any previous methods, often outperforming task-specific pre-training.
- Score: 78.28397557433544
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a simplified, task-agnostic multi-modal pre-training approach that
can accept either video or text input, or both for a variety of end tasks.
Existing pre-training are task-specific by adopting either a single cross-modal
encoder that requires both modalities, limiting their use for retrieval-style
end tasks or more complex multitask learning with two unimodal encoders,
limiting early cross-modal fusion. We instead introduce new pretraining masking
schemes that better mix across modalities (e.g. by forcing masks for text to
predict the closest video embeddings) while also maintaining separability (e.g.
unimodal predictions are sometimes required, without using all the input).
Experimental results show strong performance across a wider range of tasks than
any previous methods, often outperforming task-specific pre-training.
Related papers
- VIMI: Grounding Video Generation through Multi-modal Instruction [89.90065445082442]
Existing text-to-video diffusion models rely solely on text-only encoders for their pretraining.
We construct a large-scale multimodal prompt dataset by employing retrieval methods to pair in-context examples with the given text prompts.
We finetune the model from the first stage on three video generation tasks, incorporating multi-modal instructions.
arXiv Detail & Related papers (2024-07-08T18:12:49Z) - MultiGPrompt for Multi-Task Pre-Training and Prompting on Graphs [33.2696184519275]
MultiGPrompt is a novel multi-task pre-training and prompting framework for graph representation learning.
We propose a dual-prompt mechanism consisting of composed and open prompts to leverage task-specific and global pre-training knowledge.
arXiv Detail & Related papers (2023-11-28T02:36:53Z) - TransPrompt v2: A Transferable Prompting Framework for Cross-task Text
Classification [37.824031151922604]
We propose TransPrompt v2, a novel transferable prompting framework for few-shot learning across similar or distant text classification tasks.
For learning across similar tasks, we employ a multi-task meta-knowledge acquisition (MMA) procedure to train a meta-learner.
For learning across distant tasks, we inject the task type descriptions into the prompt, and capture the intra-type and inter-type prompt embeddings.
arXiv Detail & Related papers (2023-08-29T04:16:57Z) - Learning Easily Updated General Purpose Text Representations with
Adaptable Task-Specific Prefixes [22.661527526471996]
Fine-tuning a large pre-trained language model for each downstream task causes computational burdens.
We propose a prefix-based method to learn the fixed text representations with source tasks.
arXiv Detail & Related papers (2023-05-22T21:31:03Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z) - Vision-Language Pre-Training for Multimodal Aspect-Based Sentiment
Analysis [25.482853330324748]
Multimodal Aspect-Based Sentiment Analysis (MABSA) has attracted increasing attention in recent years.
Previous approaches either (i) use separately pre-trained visual and textual models, which ignore the crossmodal alignment or (ii) use vision-grained models pre-trained with general pre-training tasks.
We propose a task-specific Vision-Language Pre-training framework for MABSA (MABSA), which is a unified multimodal encoder-decoder architecture for all the pretraining and downstream tasks.
arXiv Detail & Related papers (2022-04-17T08:44:00Z) - Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation [86.26522210882699]
We propose Unified multimodal pre-training for both Vision-Language understanding and generation.
The proposed UniVL is capable of handling both understanding tasks and generative tasks.
Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
arXiv Detail & Related papers (2021-12-10T14:59:06Z) - Multi-Task Learning with Sequence-Conditioned Transporter Networks [67.57293592529517]
We aim to solve multi-task learning through the lens of sequence-conditioning and weighted sampling.
We propose a new suite of benchmark aimed at compositional tasks, MultiRavens, which allows defining custom task combinations.
Second, we propose a vision-based end-to-end system architecture, Sequence-Conditioned Transporter Networks, which augments Goal-Conditioned Transporter Networks with sequence-conditioning and weighted sampling.
arXiv Detail & Related papers (2021-09-15T21:19:11Z) - Temporally Correlated Task Scheduling for Sequence Learning [143.70523777803723]
In many applications, a sequence learning task is usually associated with multiple temporally correlated auxiliary tasks.
We introduce a learnable scheduler to sequence learning, which can adaptively select auxiliary tasks for training.
Our method significantly improves the performance of simultaneous machine translation and stock trend forecasting.
arXiv Detail & Related papers (2020-07-10T10:28:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.