Inverted Pyramid Multi-task Transformer for Dense Scene Understanding
- URL: http://arxiv.org/abs/2203.07997v1
- Date: Tue, 15 Mar 2022 15:29:08 GMT
- Title: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding
- Authors: Hanrong Ye and Dan Xu
- Abstract summary: We propose a novel end-to-end Inverted Pyramid multi-task Transformer (InvPT) to perform simultaneous modeling of spatial positions and multiple tasks in a unified framework.
InvPT presents an efficient UP-Transformer block to learn multi-task feature interaction at gradually increased resolutions.
Our method achieves superior multi-task performance on NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms previous state-of-the-arts.
- Score: 11.608682595506354
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-task dense scene understanding is a thriving research domain that
requires simultaneous perception and reasoning on a series of correlated tasks
with pixel-wise prediction. Most existing works encounter a severe limitation
of modeling in the locality due to heavy utilization of convolution operations,
while learning interactions and inference in a global spatial-position and
multi-task context is critical for this problem. In this paper, we propose a
novel end-to-end Inverted Pyramid multi-task (InvPT) Transformer to perform
simultaneous modeling of spatial positions and multiple tasks in a unified
framework. To the best of our knowledge, this is the first work that explores
designing a transformer structure for multi-task dense prediction for scene
understanding. Besides, it is widely demonstrated that a higher spatial
resolution is remarkably beneficial for dense predictions, while it is very
challenging for existing transformers to go deeper with higher resolutions due
to huge complexity to large spatial size. InvPT presents an efficient
UP-Transformer block to learn multi-task feature interaction at gradually
increased resolutions, which also incorporates effective self-attention message
passing and multi-scale feature aggregation to produce task-specific prediction
at a high resolution. Our method achieves superior multi-task performance on
NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms
previous state-of-the-arts. Code and trained models will be publicly available.
Related papers
- A Multitask Deep Learning Model for Classification and Regression of Hyperspectral Images: Application to the large-scale dataset [44.94304541427113]
We propose a multitask deep learning model to perform multiple classification and regression tasks simultaneously on hyperspectral images.
We validated our approach on a large hyperspectral dataset called TAIGA.
A comprehensive qualitative and quantitative analysis of the results shows that the proposed method significantly outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-23T11:14:54Z) - RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks.
RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model.
Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z) - Task Indicating Transformer for Task-conditional Dense Predictions [16.92067246179703]
We introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge.
Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition.
We also propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement.
arXiv Detail & Related papers (2024-03-01T07:06:57Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple
yet General Complementary Transformer [91.43066633305662]
We propose a novel underlineComPlementary underlinetransformer, textbfComPtr, for diverse bi-source dense prediction tasks.
ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer.
arXiv Detail & Related papers (2023-07-23T15:17:45Z) - InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene
Understanding [11.608682595506354]
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model.
Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions.
We propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context.
arXiv Detail & Related papers (2023-06-08T00:28:22Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image
Prediction [33.29925021875922]
We propose a novel Semantic-aware Decoupled Transformer Pyramid (SDTP) for dense image prediction, consisting of Intra-level Semantic Promotion (ISP), Cross-level Decoupled Interaction (CDI) and Attention Refinement Function (ARF)
ISP explores the semantic diversity in different receptive space. CDI builds the global attention and interaction among different levels in decoupled space which also solves the problem of heavy computation.
Experimental results demonstrate the validity and generality of the proposed method, which outperforms the state-of-the-art by a significant margin in dense image prediction tasks.
arXiv Detail & Related papers (2021-09-18T16:29:14Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - UPDeT: Universal Multi-agent Reinforcement Learning via Policy
Decoupling with Transformers [108.92194081987967]
We make the first attempt to explore a universal multi-agent reinforcement learning pipeline, designing one single architecture to fit tasks.
Unlike previous RNN-based models, we utilize a transformer-based model to generate a flexible policy.
The proposed model, named as Universal Policy Decoupling Transformer (UPDeT), further relaxes the action restriction and makes the multi-agent task's decision process more explainable.
arXiv Detail & Related papers (2021-01-20T07:24:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.