Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction
- URL: http://arxiv.org/abs/2308.05721v4
- Date: Thu, 21 Sep 2023 09:48:32 GMT
- Title: Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction
- Authors: Yangyang Xu, Yibo Yang, Bernard Ghanem, Lefei Zhang, Du Bo, Dacheng
Tao
- Abstract summary: CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
- Score: 126.34551436845133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: CNNs and Transformers have their own advantages and both have been widely
used for dense prediction in multi-task learning (MTL). Most of the current
studies on MTL solely rely on CNN or Transformer. In this work, we present a
novel MTL model by combining both merits of deformable CNN and query-based
Transformer with shared gating for multi-task learning of dense prediction.
This combination may offer a simple and efficient solution owing to its
powerful and flexible task-specific learning and advantages of lower cost, less
complexity and smaller parameters than the traditional MTL methods. We
introduce deformable mixer Transformer with gating (DeMTG), a simple and
effective encoder-decoder architecture up-to-date that incorporates the
convolution and attention mechanism in a unified network for MTL. It is
exquisitely designed to use advantages of each block, and provide deformable
and comprehensive features for all tasks from local and global perspective.
First, the deformable mixer encoder contains two types of operators: the
channel-aware mixing operator leveraged to allow communication among different
channels, and the spatial-aware deformable operator with deformable convolution
applied to efficiently sample more informative spatial locations. Second, the
task-aware gating transformer decoder is used to perform the task-specific
predictions, in which task interaction block integrated with self-attention is
applied to capture task interaction features, and the task query block
integrated with gating attention is leveraged to select corresponding
task-specific features. Further, the experiment results demonstrate that the
proposed DeMTG uses fewer GFLOPs and significantly outperforms current
Transformer-based and CNN-based competitive models on a variety of metrics on
three dense prediction datasets. Our code and models are available at
https://github.com/yangyangxu0/DeMTG.
Related papers
- MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction [5.8919870666241945]
We present a Multiscleimat Transformer (MART) network for multi-agent trajectory prediction.
MART is a hypergraph transformer architecture to consider individual and group behaviors in transformer machinery.
In addition, we propose an Adaptive Group Estor (AGE) designed to infer complex group relations in real-world environments.
arXiv Detail & Related papers (2024-07-31T14:31:49Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - AdaMTL: Adaptive Input-dependent Inference for Efficient Multi-Task
Learning [1.4963011898406864]
We introduce AdaMTL, an adaptive framework that learns task-aware inference policies for multi-task learning models.
AdaMTL reduces the computational complexity by 43% while improving the accuracy by 1.32% compared to single-task models.
When deployed on Vuzix M4000 smart glasses, AdaMTL reduces the inference latency and the energy consumption by up to 21.8% and 37.5%, respectively.
arXiv Detail & Related papers (2023-04-17T20:17:44Z) - DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense
Prediction [40.447092963041236]
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer.
Our method, named DeMT, is based on a simple and effective encoder-decoder architecture.
Our model uses fewer GFLOPs and significantly outperforms current Transformer- and CNN-based competitive models.
arXiv Detail & Related papers (2023-01-09T16:00:15Z) - M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task
Learning with Model-Accelerator Co-design [95.41238363769892]
Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly.
Current MTL regimes have to activate nearly the entire model even to just execute a single task.
We present a model-accelerator co-design framework to enable efficient on-device MTL.
arXiv Detail & Related papers (2022-10-26T15:40:24Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - HyperTransformer: Model Generation for Supervised and Semi-Supervised
Few-Shot Learning [14.412066456583917]
We propose a transformer-based model for few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples.
Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal.
We extend our approach to a semi-supervised regime utilizing unlabeled samples in the support set and further improving few-shot performance.
arXiv Detail & Related papers (2022-01-11T20:15:35Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.