Related papers: DeGMix: Efficient Multi-Task Dense Prediction with Deformable and Gating Mixer

DeGMix: Efficient Multi-Task Dense Prediction with Deformable and Gating Mixer

URL: http://arxiv.org/abs/2308.05721v5
Date: Sun, 02 Nov 2025 07:24:38 GMT
Title: DeGMix: Efficient Multi-Task Dense Prediction with Deformable and Gating Mixer
Authors: Yangyang Xu, Yibo Yang, Bernard Ghanem, Lefei Zhang, Bo Du, Jun Zhu,
Abstract summary: We present an efficient multi-task dense prediction with deformable and gating mixer (DeGMix)<n>The proposed DeGMix uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models.
Score: 129.61363098633782
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Convolution neural networks and Transformers have their own advantages and both have been widely used for dense prediction in multi-task learning (MTL). Existing studies typically employ either CNNs (effectively capture local spatial patterns) or Transformers (capturing long-range dependencies) independently, but integrating their strengths may yield more robust models. In this work, we present an efficient MTL model that combines the adaptive capabilities of deformable CNN and query-based Transformer with shared gating for MTL of dense prediction. This combination may offer a simple and efficient solution owing to its powerful and flexible task-specific learning and the advantages of lower cost, less complexity, and smaller parameters than traditional MTL methods. We introduce an efficient multi-task dense prediction with deformable and gating mixer (DeGMix). First, the deformable mixer encoder contains two types of operators: the channel-aware mixing operator leveraged to allow communication among different channels, and the spatial-aware deformable operator with deformable convolution applied to efficiently sample more informative spatial locations. Second, the task-aware gating transformer decoder is used to perform task-specific predictions, in which task interaction block integrated with self-attention is applied to capture task interaction features, and the task query block integrated with gating attention is leveraged to dynamically select the corresponding task-specific features. Furthermore, the results of the experiment demonstrate that the proposed DeGMix uses fewer GFLOPs and significantly outperforms current Transformer-based and CNN-based competitive models on a variety of metrics on three dense prediction datasets. Our code and models are available at https://github.com/yangyangxu0/DeMTG.

Related papers

Parameter Aware Mamba Model for Multi-task Dense Prediction [69.94454603308196]
We introduce a novel decoder-based framework, Aware Mamba Model (PAMM), specifically designed for dense prediction in multi-task learning setting.<n>It features dual state space parameter experts that integrate and set task-specific parameter priors, capturing the intrinsic properties of each task.<n>We employ the Multi-Directional Hilbert Scanning method to construct multi-angle feature sequences, thereby enhancing the sequence model's perceptual capabilities for 2D data.
arXiv Detail & Related papers (2025-11-18T13:48:00Z)
Controllable-LPMoE: Adapting to Challenging Object Segmentation via Dynamic Local Priors from Mixture-of-Experts [16.21786310193235]
We propose a novel dynamic priors-based fine-tuning paradigm with fewer trainable parameters, dubbed Controllable-LPMoE.<n>We construct a lightweight dynamic mixed local priors extractor that captures diverse local priors from input images through heterogeneous convolutions.<n>We also design a bi-directional interaction adapter that employs cosine-aligned deformable attention and channel-oriented adaptive scale enhancement.
arXiv Detail & Related papers (2025-10-24T03:03:59Z)
Resolving Token-Space Gradient Conflicts: Token Space Manipulation for Transformer-Based Multi-Task Learning [44.601029688423914]
Multi-Task Learning (MTL) enables multiple tasks to be learned within a shared network, but differences in objectives across tasks can cause negative transfer.<n>We propose Dynamic Token Modulation and Expansion (DTME-MTL), a framework applicable to any transformer-based MTL architecture.
arXiv Detail & Related papers (2025-07-10T07:13:22Z)
Learnable Multi-Scale Wavelet Transformer: A Novel Alternative to Self-Attention [0.0]
Learnable Multi-Scale Wavelet Transformer (LMWT) is a novel architecture that replaces the standard dot-product self-attention. We present the detailed mathematical formulation of the learnable Haar wavelet module and its integration into the transformer framework. Our results indicate that the LMWT achieves competitive performance while offering substantial computational advantages.
arXiv Detail & Related papers (2025-04-08T22:16:54Z)
Pilot: Building the Federated Multimodal Instruction Tuning Framework [79.56362403673354]
Our framework integrates two stages of "adapter on adapter" into the connector of the vision encoder and the LLM. In stage 1, we extract task-specific features and client-specific features from visual information. In stage 2, we build the cross-task Mixture-of-Adapters(CT-MoA) module to perform cross-task interaction.
arXiv Detail & Related papers (2025-01-23T07:49:24Z)
CAS-ViT: Convolutional Additive Self-attention Vision Transformers for Efficient Mobile Applications [73.80247057590519]
Vision Transformers (ViTs) mark a revolutionary advance in neural networks with their token mixer's powerful global context capability. We introduce CAS-ViT: Convolutional Additive Self-attention Vision Transformers, to achieve a balance between efficiency and performance in mobile applications. Our model achieves 83.0%/84.1% top-1 with only 12M/21M parameters on ImageNet-1K.
arXiv Detail & Related papers (2024-08-07T11:33:46Z)
MART: MultiscAle Relational Transformer Networks for Multi-agent Trajectory Prediction [5.8919870666241945]
We present a Multiscleimat Transformer (MART) network for multi-agent trajectory prediction. MART is a hypergraph transformer architecture to consider individual and group behaviors in transformer machinery. In addition, we propose an Adaptive Group Estor (AGE) designed to infer complex group relations in real-world environments.
arXiv Detail & Related papers (2024-07-31T14:31:49Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
AdaMTL: Adaptive Input-dependent Inference for Efficient Multi-Task Learning [1.4963011898406864]
We introduce AdaMTL, an adaptive framework that learns task-aware inference policies for multi-task learning models. AdaMTL reduces the computational complexity by 43% while improving the accuracy by 1.32% compared to single-task models. When deployed on Vuzix M4000 smart glasses, AdaMTL reduces the inference latency and the energy consumption by up to 21.8% and 37.5%, respectively.
arXiv Detail & Related papers (2023-04-17T20:17:44Z)
DeMT: Deformable Mixer Transformer for Multi-Task Learning of Dense Prediction [40.447092963041236]
We present a novel MTL model by combining both merits of deformable CNN and query-based Transformer. Our method, named DeMT, is based on a simple and effective encoder-decoder architecture. Our model uses fewer GFLOPs and significantly outperforms current Transformer- and CNN-based competitive models.
arXiv Detail & Related papers (2023-01-09T16:00:15Z)
M$^3$ViT: Mixture-of-Experts Vision Transformer for Efficient Multi-task Learning with Model-Accelerator Co-design [95.41238363769892]
Multi-task learning (MTL) encapsulates multiple learned tasks in a single model and often lets those tasks learn better jointly. Current MTL regimes have to activate nearly the entire model even to just execute a single task. We present a model-accelerator co-design framework to enable efficient on-device MTL.
arXiv Detail & Related papers (2022-10-26T15:40:24Z)
Joint Spatial-Temporal and Appearance Modeling with Transformer for Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects. The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z)
HyperTransformer: Model Generation for Supervised and Semi-Supervised Few-Shot Learning [14.412066456583917]
We propose a transformer-based model for few-shot learning that generates weights of a convolutional neural network (CNN) directly from support samples. Our method is particularly effective for small target CNN architectures where learning a fixed universal task-independent embedding is not optimal. We extend our approach to a semi-supervised regime utilizing unlabeled samples in the support set and further improving few-shot performance.
arXiv Detail & Related papers (2022-01-11T20:15:35Z)
Transformers Solve the Limited Receptive Field for Monocular Depth Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers. This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.