DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense
Prediction
- URL: http://arxiv.org/abs/2301.03461v2
- Date: Thu, 12 Jan 2023 03:07:54 GMT
- Title: DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense
Prediction
- Authors: Yangyang Xu and Yibo Yang and Lefei Zhang
- Abstract summary: We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer.
Our method, named DeMT, is based on a simple and effective encoder-decoder architecture.
Our model uses fewer GFLOPs and significantly outperforms current Transformer- and CNN-based competitive models.
- Score: 40.447092963041236
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Convolution neural networks (CNNs) and Transformers have their own advantages
and both have been widely used for dense prediction in multi-task learning
(MTL). Most of the current studies on MTL solely rely on CNN or Transformer. In
this work, we present a novel MTL model by combining both merits of deformable
CNN and query-based Transformer for multi-task learning of dense prediction.
Our method, named DeMT, is based on a simple and effective encoder-decoder
architecture (i.e., deformable mixer encoder and task-aware transformer
decoder). First, the deformable mixer encoder contains two types of operators:
the channel-aware mixing operator leveraged to allow communication among
different channels ($i.e.,$ efficient channel location mixing), and the
spatial-aware deformable operator with deformable convolution applied to
efficiently sample more informative spatial locations (i.e., deformed
features). Second, the task-aware transformer decoder consists of the task
interaction block and task query block. The former is applied to capture task
interaction features via self-attention. The latter leverages the deformed
features and task-interacted features to generate the corresponding
task-specific feature through a query-based Transformer for corresponding task
predictions. Extensive experiments on two dense image prediction datasets,
NYUD-v2 and PASCAL-Context, demonstrate that our model uses fewer GFLOPs and
significantly outperforms current Transformer- and CNN-based competitive models
on a variety of metrics. The code are available at
https://github.com/yangyangxu0/DeMT .
Related papers
- CrossMPT: Cross-attention Message-Passing Transformer for Error Correcting Codes [14.631435001491514]
We propose a novel Cross-attention Message-Passing Transformer (CrossMPT)
We show that CrossMPT significantly outperforms existing neural network-based decoders for various code classes.
Notably, CrossMPT achieves this decoding performance improvement, while significantly reducing the memory usage, complexity, inference time, and training time.
arXiv Detail & Related papers (2024-05-02T06:30:52Z) - Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction [126.34551436845133]
CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
arXiv Detail & Related papers (2023-08-10T17:37:49Z) - ConvTransSeg: A Multi-resolution Convolution-Transformer Network for
Medical Image Segmentation [14.485482467748113]
We propose a hybrid encoder-decoder segmentation model (ConvTransSeg)
It consists of a multi-layer CNN as the encoder for feature learning and the corresponding multi-level Transformer as the decoder for segmentation prediction.
Our method achieves the best performance in terms of Dice coefficient and average symmetric surface distance measures with low model complexity and memory consumption.
arXiv Detail & Related papers (2022-10-13T14:59:23Z) - Characterization of anomalous diffusion through convolutional
transformers [0.8984888893275713]
We propose a new transformer based neural network architecture for the characterization of anomalous diffusion.
Our new architecture, the Convolutional Transformer (ConvTransformer), uses a bi-layered convolutional neural network to extract features from our diffusive trajectories.
We show that the ConvTransformer is able to outperform the previous state of the art at determining the underlying diffusive regime in short trajectories.
arXiv Detail & Related papers (2022-10-10T18:53:13Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - Multimodal Fusion Transformer for Remote Sensing Image Classification [35.57881383390397]
Vision transformers (ViTs) have been trending in image classification tasks due to their promising performance when compared to convolutional neural networks (CNNs)
To achieve satisfactory performance, close to that of CNNs, transformers need fewer parameters.
We introduce a new multimodal fusion transformer (MFT) network which comprises a multihead cross patch attention (mCrossPA) for HSI land-cover classification.
arXiv Detail & Related papers (2022-03-31T11:18:41Z) - Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths.
We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately.
The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Wake Word Detection with Streaming Transformers [72.66551640048405]
We show that our proposed Transformer model outperforms the baseline convolution network by 25% on average in false rejection rate at the same false alarm rate.
Our experiments on the Mobvoi wake word dataset demonstrate that our proposed Transformer model outperforms the baseline convolution network by 25%.
arXiv Detail & Related papers (2021-02-08T19:14:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.