Multi-Task Learning with Multi-Query Transformer for Dense Prediction
- URL: http://arxiv.org/abs/2205.14354v4
- Date: Fri, 7 Apr 2023 17:58:55 GMT
- Title: Multi-Task Learning with Multi-Query Transformer for Dense Prediction
- Authors: Yangyang Xu, Xiangtai Li, Haobo Yuan, Yibo Yang, Lefei Zhang
- Abstract summary: We propose a simple pipeline named Multi-Query Transformer (MQTransformer) to facilitate the reasoning among multiple tasks.
Instead of modeling the dense per-pixel context among different tasks, we seek a task-specific proxy to perform cross-task reasoning.
Experiment results show that the proposed method is an effective approach and achieves state-of-the-art results.
- Score: 38.476408482050815
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Previous multi-task dense prediction studies developed complex pipelines such
as multi-modal distillations in multiple stages or searching for task
relational contexts for each task. The core insight beyond these methods is to
maximize the mutual effects of each task. Inspired by the recent query-based
Transformers, we propose a simple pipeline named Multi-Query Transformer
(MQTransformer) that is equipped with multiple queries from different tasks to
facilitate the reasoning among multiple tasks and simplify the cross-task
interaction pipeline. Instead of modeling the dense per-pixel context among
different tasks, we seek a task-specific proxy to perform cross-task reasoning
via multiple queries where each query encodes the task-related context. The
MQTransformer is composed of three key components: shared encoder, cross-task
query attention module and shared decoder. We first model each task with a
task-relevant query. Then both the task-specific feature output by the feature
extractor and the task-relevant query are fed into the shared encoder, thus
encoding the task-relevant query from the task-specific feature. Secondly, we
design a cross-task query attention module to reason the dependencies among
multiple task-relevant queries; this enables the module to only focus on the
query-level interaction. Finally, we use a shared decoder to gradually refine
the image features with the reasoned query features from different tasks.
Extensive experiment results on two dense prediction datasets (NYUD-v2 and
PASCAL-Context) show that the proposed method is an effective approach and
achieves state-of-the-art results. Code and models are available at
https://github.com/yangyangxu0/MQTransformer.
Related papers
- DOCTR: Disentangled Object-Centric Transformer for Point Scene Understanding [7.470587868134298]
Point scene understanding is a challenging task to process real-world scene point cloud.
Recent state-of-the-art method first segments each object and then processes them independently with multiple stages for the different sub-tasks.
We propose a novel Disentangled Object-Centric TRansformer (DOCTR) that explores object-centric representation.
arXiv Detail & Related papers (2024-03-25T05:22:34Z) - Task Indicating Transformer for Task-conditional Dense Predictions [16.92067246179703]
We introduce a novel task-conditional framework called Task Indicating Transformer (TIT) to tackle this challenge.
Our approach designs a Mix Task Adapter module within the transformer block, which incorporates a Task Indicating Matrix through matrix decomposition.
We also propose a Task Gate Decoder module that harnesses a Task Indicating Vector and gating mechanism to facilitate adaptive multi-scale feature refinement.
arXiv Detail & Related papers (2024-03-01T07:06:57Z) - TaskExpert: Dynamically Assembling Multi-Task Representations with
Memorial Mixture-of-Experts [11.608682595506354]
Recent models consider directly decoding task-specific features from one shared task-generic feature.
As the input feature is fully shared and each task decoder also shares decoding parameters for different input samples, it leads to a static feature decoding process.
We propose TaskExpert, a novel multi-task mixture-of-experts model that enables learning multiple representative task-generic feature spaces.
arXiv Detail & Related papers (2023-07-28T06:00:57Z) - MINOTAUR: Multi-task Video Grounding From Multimodal Queries [70.08973664126873]
We present a single, unified model for tackling query-based video understanding in long-form videos.
In particular, our model can address all three tasks of the Ego4D Episodic Memory benchmark.
arXiv Detail & Related papers (2023-02-16T04:00:03Z) - Fast Inference and Transfer of Compositional Task Structures for
Few-shot Task Generalization [101.72755769194677]
We formulate it as a few-shot reinforcement learning problem where a task is characterized by a subtask graph.
Our multi-task subtask graph inferencer (MTSGI) first infers the common high-level task structure in terms of the subtask graph from the training tasks.
Our experiment results on 2D grid-world and complex web navigation domains show that the proposed method can learn and leverage the common underlying structure of the tasks for faster adaptation to the unseen tasks.
arXiv Detail & Related papers (2022-05-25T10:44:25Z) - Exploring Relational Context for Multi-Task Dense Prediction [76.86090370115]
We consider a multi-task environment for dense prediction tasks, represented by a common backbone and independent task-specific heads.
We explore various attention-based contexts, such as global and local, in the multi-task setting.
We propose an Adaptive Task-Relational Context module, which samples the pool of all available contexts for each task pair.
arXiv Detail & Related papers (2021-04-28T16:45:56Z) - Transformer is All You Need: Multimodal Multitask Learning with a
Unified Transformer [24.870827400461682]
We propose a Unified Transformer model to simultaneously learn the most prominent tasks across different domains.
Based on the transformer encoder-decoder architecture, our UniT model encodes each input modality with an encoder and makes predictions on each task.
The entire model is jointly trained end-to-end with losses from each task.
arXiv Detail & Related papers (2021-02-22T04:45:06Z) - CompositeTasking: Understanding Images by Spatial Composition of Tasks [85.95743368954233]
CompositeTasking is the fusion of multiple, spatially distributed tasks.
The proposed network takes a pair of an image and a set of pixel-wise dense tasks as inputs, and makes the task related predictions for each pixel.
It not only offers us a compact network for multi-tasking, but also allows for task-editing.
arXiv Detail & Related papers (2020-12-16T15:47:02Z) - MTI-Net: Multi-Scale Task Interaction Networks for Multi-Task Learning [82.62433731378455]
We show that tasks with high affinity at a certain scale are not guaranteed to retain this behaviour at other scales.
We propose a novel architecture, namely MTI-Net, that builds upon this finding.
arXiv Detail & Related papers (2020-01-19T21:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.