Unifying Architectures, Tasks, and Modalities Through a Simple
Sequence-to-Sequence Learning Framework
- URL: http://arxiv.org/abs/2202.03052v1
- Date: Mon, 7 Feb 2022 10:38:21 GMT
- Title: Unifying Architectures, Tasks, and Modalities Through a Simple
Sequence-to-Sequence Learning Framework
- Authors: Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li,
Jianxin Ma, Chang Zhou, Jingren Zhou, Hongxia Yang
- Abstract summary: We propose OFA, a unified multimodal pretrained model that unifies modalities (i.e., cross-modality, vision, language) and tasks (e.g., image generation, visual grounding, image captioning, image classification, text generation, etc.)
OFA achieves new state-of-the-arts on a series of multimodal tasks, including image captioning (COCO test CIDEr: 149.6), text-to-image generation (COCO test FID: 10.5), VQA (test-std encoder acc.: 80.02), SNLI-VE (test acc.: 90.
- Score: 83.82026345508334
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this work, we pursue a unified paradigm for multimodal pretraining to
break the scaffolds of complex task/modality-specific customization. We propose
OFA, a unified multimodal pretrained model that unifies modalities (i.e.,
cross-modality, vision, language) and tasks (e.g., image generation, visual
grounding, image captioning, image classification, text generation, etc.) to a
simple sequence-to-sequence learning framework based on the encoder-decoder
architecture. OFA performs pretraining and finetuning with task instructions
and introduces no extra task-specific layers for finetuning. Experimental
results show that OFA achieves new state-of-the-arts on a series of multimodal
tasks, including image captioning (COCO test CIDEr: 149.6), text-to-image
generation (COCO test FID: 10.5), VQA (test-std acc.: 80.02), SNLI-VE (test
acc.: 90.20), and referring expression comprehension (RefCOCO / RefCOCO+ /
RefCOCOg test acc.: 92.93 / 90.10 / 85.20). Through extensive analyses, we
demonstrate that OFA reaches comparable performance with uni-modal pretrained
models (e.g., BERT, MAE, MoCo v3, SimCLR v2, etc.) in uni-modal tasks,
including NLU, NLG, and image classification, and it effectively transfers to
unseen tasks and domains. Code shall be released soon at
http://github.com/OFA-Sys/OFA
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning [23.846476546733406]
In-context learning provides a new perspective for multi-task modeling for vision and NLP.
We propose Skeleton-in-Context, an effective framework for in-context skeleton sequence modeling.
Our model achieves state-of-the-art multi-task performance and even outperforms single-task methods on certain tasks.
arXiv Detail & Related papers (2023-12-06T18:59:44Z) - MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks [59.09343552273045]
We propose a decoder-only model for multimodal tasks, which is surprisingly effective in jointly learning of these disparate vision-language tasks.
We demonstrate that joint learning of these diverse objectives is simple, effective, and maximizes the weight-sharing of the model across these tasks.
Our model achieves the state of the art on image-text and text-image retrieval, video question answering and open-vocabulary detection tasks, outperforming much larger and more extensively trained foundational models.
arXiv Detail & Related papers (2023-03-29T16:42:30Z) - OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist
Models [72.8156832931841]
Generalist models are capable of performing diverse multi-modal tasks in a task-agnostic way within a single model.
We release a generalist model learning system, OFASys, built on top of a declarative task interface named multi-modal instruction.
arXiv Detail & Related papers (2022-12-08T17:07:09Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Grad2Task: Improved Few-shot Text Classification Using Gradients for
Task Representation [24.488427641442694]
We propose a novel conditional neural process-based approach for few-shot text classification.
Our key idea is to represent each task using gradient information from a base model.
Our approach outperforms traditional fine-tuning, sequential transfer learning, and state-of-the-art meta learning approaches.
arXiv Detail & Related papers (2022-01-27T15:29:30Z) - Unified Multimodal Pre-training and Prompt-based Tuning for
Vision-Language Understanding and Generation [86.26522210882699]
We propose Unified multimodal pre-training for both Vision-Language understanding and generation.
The proposed UniVL is capable of handling both understanding tasks and generative tasks.
Our experiments show that there is a trade-off between understanding tasks and generation tasks while using the same model.
arXiv Detail & Related papers (2021-12-10T14:59:06Z) - Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
for Zero-shot and Few-shot Tasks [73.63892022944198]
We present a generic perception architecture named Uni-Perceiver.
It processes a variety of modalities and tasks with unified modeling and shared parameters.
Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks.
arXiv Detail & Related papers (2021-12-02T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.