Transformer is All You Need: Multimodal Multitask Learning with a
Unified Transformer
- URL: http://arxiv.org/abs/2102.10772v1
- Date: Mon, 22 Feb 2021 04:45:06 GMT
- Title: Transformer is All You Need: Multimodal Multitask Learning with a
Unified Transformer
- Authors: Ronghang Hu, Amanpreet Singh
- Abstract summary: We propose a Unified Transformer model to simultaneously learn the most prominent tasks across different domains.
Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task.
The entire model is jointly trained end-to-end with losses from each task.
- Score: 24.870827400461682
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose UniT, a Unified Transformer model to simultaneously learn the most
prominent tasks across different domains, ranging from object detection to
language understanding and multimodal reasoning. Based on the transformer
encoder-decoder architecture, our UniT model encodes each input modality with
an encoder and makes predictions on each task with a shared decoder over the
encoded input representations, followed by task-specific output heads. The
entire model is jointly trained end-to-end with losses from each task. Compared
to previous efforts on multi-task learning with transformers, we share the same
model parameters to all tasks instead of separately fine-tuning task-specific
models and handle a much higher variety of tasks across different domains. In
our experiments, we learn 7 tasks jointly over 8 datasets, achieving comparable
performance to well-established prior work on each domain under the same
supervision with a compact set of model parameters. Code will be released in
MMF at https://mmf.sh.
Related papers
- Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene
Understanding [11.608682595506354]
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model.
Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions.
We propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context.
arXiv Detail & Related papers (2023-06-08T00:28:22Z) - DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense
Prediction [40.447092963041236]
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer.
Our method, named DeMT, is based on a simple and effective encoder-decoder architecture.
Our model uses fewer GFLOPs and significantly outperforms current Transformer- and CNN-based competitive models.
arXiv Detail & Related papers (2023-01-09T16:00:15Z) - Improving Cross-task Generalization of Unified Table-to-text Models with
Compositional Task Configurations [63.04466647849211]
Methods typically encode task information with a simple dataset name as a prefix to the encoder.
We propose compositional task configurations, a set of prompts prepended to the encoder to improve cross-task generalization.
We show this not only allows the model to better learn shared knowledge across different tasks at training, but also allows us to control the model by composing new configurations.
arXiv Detail & Related papers (2022-12-17T02:20:14Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - VUT: Versatile UI Transformer for Multi-Modal Multi-Task User Interface
Modeling [11.569380762858815]
VUT is a Versatile UI Transformer that takes multimodal input and simultaneously accomplishes 5 distinct tasks with the same model.
Our model consists of a multimodal Transformer encoder that jointly encodes UI images and structures, and performs UI object detection when the UI structures are absent in the input.
arXiv Detail & Related papers (2021-12-10T17:37:26Z) - Uni-Perceiver: Pre-training Unified Architecture for Generic Perception
for Zero-shot and Few-shot Tasks [73.63892022944198]
We present a generic perception architecture named Uni-Perceiver.
It processes a variety of modalities and tasks with unified modeling and shared parameters.
Results show that our pre-trained model without any tuning can achieve reasonable performance even on novel tasks.
arXiv Detail & Related papers (2021-12-02T18:59:50Z) - PolyViT: Co-training Vision Transformers on Images, Videos and Audio [80.0913507142036]
We present PolyViT, a model trained on image, audio and video.
By co-training different tasks on a single modality, we are able to improve the accuracy of each individual task.
We show that co-training is simple and practical to implement.
arXiv Detail & Related papers (2021-11-25T10:01:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.