Inverted Pyramid Multi-task Transformer for Dense Scene Understanding
- URL: http://arxiv.org/abs/2203.07997v1
- Date: Tue, 15 Mar 2022 15:29:08 GMT
- Title: Inverted Pyramid Multi-task Transformer for Dense Scene Understanding
- Authors: Hanrong Ye and Dan Xu
- Abstract summary: We propose a novel end-to-end Inverted Pyramid multi-task Transformer (InvPT) to perform simultaneous modeling of spatial positions and multiple tasks in a unified framework.
InvPT presents an efficient UP-Transformer block to learn multi-task feature interaction at gradually increased resolutions.
Our method achieves superior multi-task performance on NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms previous state-of-the-arts.
- Score: 11.608682595506354
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Multi-task dense scene understanding is a thriving research domain that
requires simultaneous perception and reasoning on a series of correlated tasks
with pixel-wise prediction. Most existing works encounter a severe limitation
of modeling in the locality due to heavy utilization of convolution operations,
while learning interactions and inference in a global spatial-position and
multi-task context is critical for this problem. In this paper, we propose a
novel end-to-end Inverted Pyramid multi-task (InvPT) Transformer to perform
simultaneous modeling of spatial positions and multiple tasks in a unified
framework. To the best of our knowledge, this is the first work that explores
designing a transformer structure for multi-task dense prediction for scene
understanding. Besides, it is widely demonstrated that a higher spatial
resolution is remarkably beneficial for dense predictions, while it is very
challenging for existing transformers to go deeper with higher resolutions due
to huge complexity to large spatial size. InvPT presents an efficient
UP-Transformer block to learn multi-task feature interaction at gradually
increased resolutions, which also incorporates effective self-attention message
passing and multi-scale feature aggregation to produce task-specific prediction
at a high resolution. Our method achieves superior multi-task performance on
NYUD-v2 and PASCAL-Context datasets respectively, and significantly outperforms
previous state-of-the-arts. Code and trained models will be publicly available.
Related papers
- LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.
We introduce key innovations to optimize generative performance for vision tasks.
The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - HRVMamba: High-Resolution Visual State Space Model for Dense Prediction [60.80423207808076]
State Space Models (SSMs) with efficient hardware-aware designs have demonstrated significant potential in computer vision tasks.
These models have been constrained by three key challenges: insufficient inductive bias, long-range forgetting, and low-resolution output representation.
We introduce the Dynamic Visual State Space (DVSS) block, which employs deformable convolution to mitigate the long-range forgetting problem.
We also introduce High-Resolution Visual State Space Model (HRVMamba) based on the DVSS block, which preserves high-resolution representations throughout the entire process.
arXiv Detail & Related papers (2024-10-04T06:19:29Z) - DeepInteraction++: Multi-Modality Interaction for Autonomous Driving [80.8837864849534]
We introduce a novel modality interaction strategy that allows individual per-modality representations to be learned and maintained throughout.
DeepInteraction++ is a multi-modal interaction framework characterized by a multi-modal representational interaction encoder and a multi-modal predictive interaction decoder.
Experiments demonstrate the superior performance of the proposed framework on both 3D object detection and end-to-end autonomous driving tasks.
arXiv Detail & Related papers (2024-08-09T14:04:21Z) - A Multitask Deep Learning Model for Classification and Regression of Hyperspectral Images: Application to the large-scale dataset [44.94304541427113]
We propose a multitask deep learning model to perform multiple classification and regression tasks simultaneously on hyperspectral images.
We validated our approach on a large hyperspectral dataset called TAIGA.
A comprehensive qualitative and quantitative analysis of the results shows that the proposed method significantly outperforms other state-of-the-art methods.
arXiv Detail & Related papers (2024-07-23T11:14:54Z) - RepVF: A Unified Vector Fields Representation for Multi-task 3D Perception [64.80760846124858]
This paper proposes a novel unified representation, RepVF, which harmonizes the representation of various perception tasks.
RepVF characterizes the structure of different targets in the scene through a vector field, enabling a single-head, multi-task learning model.
Building upon RepVF, we introduce RFTR, a network designed to exploit the inherent connections between different tasks.
arXiv Detail & Related papers (2024-07-15T16:25:07Z) - Task Indicating Transformer for Task-conditional Dense Predictions [16.92067246179703]
We introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge.
Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition.
We also propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement.
arXiv Detail & Related papers (2024-03-01T07:06:57Z) - ComPtr: Towards Diverse Bi-source Dense Prediction Tasks via A Simple
yet General Complementary Transformer [91.43066633305662]
We propose a novel underlineComPlementary underlinetransformer, textbfComPtr, for diverse bi-source dense prediction tasks.
ComPtr treats different inputs equally and builds an efficient dense interaction model in the form of sequence-to-sequence on top of the transformer.
arXiv Detail & Related papers (2023-07-23T15:17:45Z) - InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene
Understanding [11.608682595506354]
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model.
Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions.
We propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context.
arXiv Detail & Related papers (2023-06-08T00:28:22Z) - MulT: An End-to-End Multitask Learning Transformer [66.52419626048115]
We propose an end-to-end Multitask Learning Transformer framework, named MulT, to simultaneously learn multiple high-level vision tasks.
Our framework encodes the input image into a shared representation and makes predictions for each vision task using task-specific transformer-based decoder heads.
arXiv Detail & Related papers (2022-05-17T13:03:18Z) - SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image
Prediction [33.29925021875922]
We propose a novel Semantic-aware Decoupled Transformer Pyramid (SDTP) for dense image prediction, consisting of Intra-level Semantic Promotion (ISP), Cross-level Decoupled Interaction (CDI) and Attention Refinement Function (ARF)
ISP explores the semantic diversity in different receptive space. CDI builds the global attention and interaction among different levels in decoupled space which also solves the problem of heavy computation.
Experimental results demonstrate the validity and generality of the proposed method, which outperforms the state-of-the-art by a significant margin in dense image prediction tasks.
arXiv Detail & Related papers (2021-09-18T16:29:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.