Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction
- URL: http://arxiv.org/abs/2308.05721v4
- Date: Thu, 21 Sep 2023 09:48:32 GMT
- Title: Deformable Mixer Transformer with Gating for Multi-Task Learning of
Dense Prediction
- Authors: Yangyang Xu, Yibo Yang, Bernard Ghanem, Lefei Zhang, Du Bo, Dacheng
Tao
- Abstract summary: CNNs and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL)
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer with shared gating for multi-task learning of dense prediction.
- Score: 126.34551436845133
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: CNNs and Transformers have their own advantages and both have been widely
used for dense prediction in multi-task learning (MTL). Most of the current
studies on MTL solely rely on CNN or Transformer. In this work, we present a
novel MTL model by combining both merits of deformable CNN and query-based
Transformer with shared gating for multi-task learning of dense prediction.
This combination may offer a simple and efficient solution owing to its
powerful and flexible task-specific learning and advantages of lower cost, less
complexity and smaller parameters than the traditional MTL methods. We
introduce deformable mixer Transformer with gating (DeMTG), a simple and
effective encoder-decoder architecture up-to-date that incorporates the
convolution and attention mechanism in a unified network for MTL. It is
exquisitely designed to use advantages of each block, and provide deformable
and comprehensive features for all tasks from local and global perspective.
First, the deformable mixer encoder contains two types of operators: the
channel-aware mixing operator leveraged to allow communication among different
channels, and the spatial-aware deformable operator with deformable convolution
applied to efficiently sample more informative spatial locations. Second, the
task-aware gating transformer decoder is used to perform the task-specific
predictions, in which task interaction block integrated with self-attention is
applied to capture task interaction features, and the task query block
integrated with gating attention is leveraged to select corresponding
task-specific features. Further, the experiment results demonstrate that the
proposed DeMTG uses fewer GFLOPs and significantly outperforms current
Transformer-based and CNN-based competitive models on a variety of metrics on
three dense prediction datasets. Our code and models are available at
https://github.com/yangyangxu0/DeMTG.
Related papers
- Pilot: Building the Federated Multimodal Instruction Tuning Framework [79.56362403673354]
Our framework integrates two stages of "adapter on adapter" into the connector of the vision encoder and the LLM.
In stage 1, we extract task-specific features and client-specific features from visual information.
In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction.
arXiv Detail & Related papers (2025-01-23T07:49:24Z) - CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability.
We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications.
Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z) - MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction [5.8919870666241945]
We present a Multiscleimat Transformer (MART) network for multi-agent trajectory prediction.
MART is a hypergraph transformer architecture to consider individual and group behaviors in transformer machinery.
In addition, we propose an Adaptive Group Estor (AGE) designed to infer complex group relations in real-world environments.
arXiv Detail & Related papers (2024-07-31T14:31:49Z) - AdaMTL: Adaptive Input-dependent Inference for Efficient Multi-Task
Learning [1.4963011898406864]
We introduce AdaMTL, an adaptive framework that learns task-aware inference policies for multi-task learning models.
AdaMTL reduces the computational complexity by 43% while improving the accuracy by 1.32% compared to single-task models.
When deployed on Vuzix M4000 smart glasses, AdaMTL reduces the inference latency and the energy consumption by up to 21.8% and 37.5%, respectively.
arXiv Detail & Related papers (2023-04-17T20:17:44Z) - DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense
Prediction [40.447092963041236]
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer.
Our method, named DeMT, is based on a simple and effective encoder-decoder architecture.
Our model uses fewer GFLOPs and significantly outperforms current Transformer- and CNN-based competitive models.
arXiv Detail & Related papers (2023-01-09T16:00:15Z) - M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task
Learning with Model-Accelerator Co-design [95.41238363769892]
Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly.
Current MTL regimes have to activate nearly the entire model even to just execute a single task.
We present a model-accelerator co-design framework to enable efficient on-device MTL.
arXiv Detail & Related papers (2022-10-26T15:40:24Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - HyperTransformer: Model Generation for Supervised and Semi-Supervised
Few-Shot Learning [14.412066456583917]
We propose a transformer-based model for few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples.
Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal.
We extend our approach to a semi-supervised regime utilizing unlabeled samples in the support set and further improving few-shot performance.
arXiv Detail & Related papers (2022-01-11T20:15:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.