Related papers: Multi-Task Learning with Multi-Query Transformer for Dense Prediction

Multi-Task Learning with Multi-Query Transformer for Dense Prediction

URL: http://arxiv.org/abs/2205.14354v4
Date: Fri, 7 Apr 2023 17:58:55 GMT
Title: Multi-Task Learning with Multi-Query Transformer for Dense Prediction
Authors: Yangyang Xu, Xiangtai Li, Haobo Yuan, Yibo Yang, Lefei Zhang
Abstract summary: We propose a simple pipeline named Multi-Query Transformer (MQTransformer) to facilitate the reasoning among multiple tasks. Instead of modeling the dense per-pixel context among different tasks, we seek a task-specific proxy to perform cross-task reasoning. Experiment results show that the proposed method is an effective approach and achieves state-of-the-art results.
Score: 38.476408482050815
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Previous multi-task dense prediction studies developed complex pipelines such as multi-modal distillations in multiple stages or searching for task relational contexts for each task. The core insight beyond these methods is to maximize the mutual effects of each task. Inspired by the recent query-based Transformers, we propose a simple pipeline named Multi-Query Transformer (MQTransformer) that is equipped with multiple queries from different tasks to facilitate the reasoning among multiple tasks and simplify the cross-task interaction pipeline. Instead of modeling the dense per-pixel context among different tasks, we seek a task-specific proxy to perform cross-task reasoning via multiple queries where each query encodes the task-related context. The MQTransformer is composed of three key components: shared encoder, cross-task query attention module and shared decoder. We first model each task with a task-relevant query. Then both the task-specific feature output by the feature extractor and the task-relevant query are fed into the shared encoder, thus encoding the task-relevant query from the task-specific feature. Secondly, we design a cross-task query attention module to reason the dependencies among multiple task-relevant queries; this enables the module to only focus on the query-level interaction. Finally, we use a shared decoder to gradually refine the image features with the reasoned query features from different tasks. Extensive experiment results on two dense prediction datasets (NYUD-v2 and PASCAL-Context) show that the proposed method is an effective approach and achieves state-of-the-art results. Code and models are available at https://github.com/yangyangxu0/MQTransformer.

Related papers

M3Net: Multimodal Multi-task Learning for 3D Detection, Segmentation, and Occupancy Prediction in Autonomous Driving [48.17490295484055]
M3Net is a novel network that simultaneously tackles detection, segmentation, and 3D occupancy prediction for autonomous driving. M3Net achieves state-of-the-art multi-task learning performance on the nuScenes benchmarks.
arXiv Detail & Related papers (2025-03-23T15:08:09Z)
DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding [7.470587868134298]
Point scene understanding is a challenging task to process real-world scene point cloud. Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks. We propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation.
arXiv Detail & Related papers (2024-03-25T05:22:34Z)
Task Indicating Transformer for Task-conditional Dense Predictions [16.92067246179703]
We introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge. Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition. We also propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement.
arXiv Detail & Related papers (2024-03-01T07:06:57Z)
TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts [11.608682595506354]
Recent models consider directly decoding task-specific features from one shared task-generic feature. As the input feature is fully shared and each task decoder also shares decoding parameters for different input samples, it leads to a static feature decoding process. We propose TaskExpert, a novel multi-task mixture-of-experts model that enables learning multiple representative task-generic feature spaces.
arXiv Detail & Related papers (2023-07-28T06:00:57Z)
MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos. In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z)
Fast Inference and Transfer of Compositional Task Structures for Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph. Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks. Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z)
Exploring Relational Context for Multi-Task Dense Prediction [76.86090370115]
We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads. We explore various attention-based contexts, such as global and local, in the multi-task setting. We propose an Adaptive Task-Relational Context module, which samples the pool of all available contexts for each task pair.
arXiv Detail & Related papers (2021-04-28T16:45:56Z)
Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer [24.870827400461682]
We propose a Unified Transformer model to simultaneously learn the most prominent tasks across different domains. Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task. The entire model is jointly trained end-to-end with losses from each task.
arXiv Detail & Related papers (2021-02-22T04:45:06Z)
CompositeTasking: Understanding Images by Spatial Composition of Tasks [85.95743368954233]
CompositeTasking is the fusion of multiple, spatially distributed tasks. The proposed network takes a pair of an image and a set of pixel-wise dense tasks as inputs, and makes the task related predictions for each pixel. It not only offers us a compact network for multi-tasking, but also allows for task-editing.
arXiv Detail & Related papers (2020-12-16T15:47:02Z)
MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning [82.62433731378455]
We show that tasks with high affinity at a certain scale are not guaranteed to retain this behaviour at other scales. We propose a novel architecture, namely MTI-Net, that builds upon this finding.
arXiv Detail & Related papers (2020-01-19T21:02:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.