MulT: An End-to-End Multitask Learning Transformer
- URL: http://arxiv.org/abs/2205.08303v1
- Date: Tue, 17 May 2022 13:03:18 GMT
- Title: MulT: An End-to-End Multitask Learning Transformer
- Authors: Deblina Bhattacharjee, Tong Zhang, Sabine S\"usstrunk and Mathieu
Salzmann
- Abstract summary: We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
- Score: 66.52419626048115
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We propose an end-to-end Multitask Learning Transformer framework, named
MulT, to simultaneously learn multiple high-level vision tasks, including depth
estimation, semantic segmentation, reshading, surface normal estimation, 2D
keypoint detection, and edge detection. Based on the Swin transformer model,
our framework encodes the input image into a shared representation and makes
predictions for each vision task using task-specific transformer-based decoder
heads. At the heart of our approach is a shared attention mechanism modeling
the dependencies across the tasks. We evaluate our model on several multitask
benchmarks, showing that our MulT framework outperforms both the state-of-the
art multitask convolutional neural network models and all the respective single
task transformer models. Our experiments further highlight the benefits of
sharing attention across all the tasks, and demonstrate that our MulT model is
robust and generalizes well to new domains. Our project website is at
https://ivrl.github.io/MulT/.
Related papers
- A Multitask Deep Learning Model for Classification and Regression of Hyperspectral Images: Application to the large-scale dataset [44.94304541427113]
We propose a multitask deep learning model to perform multiple classification and regression tasks simultaneously on hyperspectral images.
We validated our approach on a large hyperspectral dataset called TAIGA.
A comprehensive qualitative and quantitative analysis of the results shows that the proposed method significantly outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-23T11:14:54Z) - MTP: Advancing Remote Sensing Foundation Model via Multi-Task Pretraining [73.81862342673894]
Foundation models have reshaped the landscape of Remote Sensing (RS) by enhancing various image interpretation tasks.
transferring the pretrained models to downstream tasks may encounter task discrepancy due to their formulation of pretraining as image classification or object discrimination tasks.
We conduct multi-task supervised pretraining on the SAMRS dataset, encompassing semantic segmentation, instance segmentation, and rotated object detection.
Our models are finetuned on various RS downstream tasks, such as scene classification, horizontal and rotated object detection, semantic segmentation, and change detection.
arXiv Detail & Related papers (2024-03-20T09:17:22Z) - GiT: Towards Generalist Vision Transformer through Universal Language Interface [94.33443158125186]
This paper proposes a simple, yet effective framework, called GiT, simultaneously applicable for various vision tasks only with a vanilla ViT.
GiT is a multi-task visual model, jointly trained across five representative benchmarks without task-specific fine-tuning.
arXiv Detail & Related papers (2024-03-14T13:47:41Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z) - An Efficient General-Purpose Modular Vision Model via Multi-Task
Heterogeneous Training [79.78201886156513]
We present a model that can perform multiple vision tasks and can be adapted to other downstream tasks efficiently.
Our approach achieves comparable results to single-task state-of-the-art models and demonstrates strong generalization on downstream tasks.
arXiv Detail & Related papers (2023-06-29T17:59:57Z) - InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene
Understanding [11.608682595506354]
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model.
Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions.
We propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context.
arXiv Detail & Related papers (2023-06-08T00:28:22Z) - Visual Transformer for Object Detection [0.0]
We consider the use of self-attention for discriminative visual tasks, object detection, as an alternative to convolutions.
Our model leads to consistent improvements in object detection on COCO across many different models and scales.
arXiv Detail & Related papers (2022-06-01T06:13:09Z) - Generative Modeling for Multi-task Visual Learning [40.96212750592383]
We consider a novel problem of learning a shared generative model that is useful across various visual perception tasks.
We propose a general multi-task oriented generative modeling framework, by coupling a discriminative multi-task network with a generative network.
Our framework consistently outperforms state-of-the-art multi-task approaches.
arXiv Detail & Related papers (2021-06-25T03:42:59Z) - Transformer is All You Need: Multimodal Multitask Learning with a
Unified Transformer [24.870827400461682]
We propose a Unified Transformer model to simultaneously learn the most prominent tasks across different domains.
Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task.
The entire model is jointly trained end-to-end with losses from each task.
arXiv Detail & Related papers (2021-02-22T04:45:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.