Learning Dynamic Query Combinations for Transformer-based Object
Detection and Segmentation
- URL: http://arxiv.org/abs/2307.12239v2
- Date: Thu, 27 Jul 2023 18:46:42 GMT
- Title: Learning Dynamic Query Combinations for Transformer-based Object
Detection and Segmentation
- Authors: Yiming Cui, Linjie Yang, Haichao Yu
- Abstract summary: Transformer-based detection and segmentation methods use a list of learned detection queries to retrieve information from the transformer network.
We empirically find that random convex combinations of the learned queries are still good for the corresponding models.
We propose to learn a convex combination with dynamic coefficients based on the high-level semantics of the image.
- Score: 37.24532930188581
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based detection and segmentation methods use a list of learned
detection queries to retrieve information from the transformer network and
learn to predict the location and category of one specific object from each
query. We empirically find that random convex combinations of the learned
queries are still good for the corresponding models. We then propose to learn a
convex combination with dynamic coefficients based on the high-level semantics
of the image. The generated dynamic queries, named modulated queries, better
capture the prior of object locations and categories in the different images.
Equipped with our modulated queries, a wide range of DETR-based models achieve
consistent and superior performance across multiple tasks including object
detection, instance segmentation, panoptic segmentation, and video instance
segmentation.
Related papers
- Language-aware Multiple Datasets Detection Pretraining for DETRs [4.939595148195813]
We propose a framework for utilizing Multiple datasets to pretrain DETR-like detectors, termed METR.
It converts the typical multi-classification in object detection into binary classification by introducing a pre-trained language model.
We show METR achieves extraordinary results on either multi-task joint training or the pretrain & finetune paradigm.
arXiv Detail & Related papers (2023-04-07T10:34:04Z) - FAQ: Feature Aggregated Queries for Transformer-based Video Object
Detectors [37.38250825377456]
We take a different perspective on video object detection. In detail, we improve the qualities of queries for the Transformer-based models by aggregation.
On the challenging ImageNet VID benchmark, when integrated with our proposed modules, the current state-of-the-art Transformer-based object detectors can be improved by more than 2.4% on mAP and 4.2% on AP50.
arXiv Detail & Related papers (2023-03-15T02:14:56Z) - Learning Equivariant Segmentation with Instance-Unique Querying [47.52528819153683]
We devise a new training framework that boosts query-based models through discriminative query embedding learning.
Our algorithm uses the queries to retrieve the corresponding instances from the whole training dataset.
On top of four famous, query-based models, our training algorithm provides significant performance gains.
arXiv Detail & Related papers (2022-10-03T13:14:00Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Fusing Local Similarities for Retrieval-based 3D Orientation Estimation
of Unseen Objects [70.49392581592089]
We tackle the task of estimating the 3D orientation of previously-unseen objects from monocular images.
We follow a retrieval-based strategy and prevent the network from learning object-specific features.
Our experiments on the LineMOD, LineMOD-Occluded, and T-LESS datasets show that our method yields a significantly better generalization to unseen objects than previous works.
arXiv Detail & Related papers (2022-03-16T08:53:00Z) - Visual Transformers with Primal Object Queries for Multi-Label Image
Classification [32.63955272381003]
We propose the usage of primal object queries that are only provided at the start of the transformer decoder stack.
The proposed transformer model with primal object queries improves the state-of-the-art class wise F1 metric by 2.1% and 1.8%.
arXiv Detail & Related papers (2021-12-10T12:29:07Z) - Vision-Language Transformer and Query Generation for Referring
Segmentation [39.01244764840372]
We reformulate referring segmentation as a direct attention problem.
We build a network with an encoder-decoder attention mechanism architecture that "queries" the given image with the language expression.
Our approach is light-weight and achieves new state-of-the-art performance consistently on three referring segmentation datasets.
arXiv Detail & Related papers (2021-08-12T07:24:35Z) - Visual Composite Set Detection Using Part-and-Sum Transformers [74.26037922682355]
We present a new approach, denoted Part-and-Sum detection Transformer (PST), to perform end-to-end composite set detection.
PST achieves state-of-the-art results among single-stage models, while nearly matching the results of custom-designed two-stage models.
arXiv Detail & Related papers (2021-05-05T16:31:32Z) - Learning to Compose Hypercolumns for Visual Correspondence [57.93635236871264]
We introduce a novel approach to visual correspondence that dynamically composes effective features by leveraging relevant layers conditioned on the images to match.
The proposed method, dubbed Dynamic Hyperpixel Flow, learns to compose hypercolumn features on the fly by selecting a small number of relevant layers from a deep convolutional neural network.
arXiv Detail & Related papers (2020-07-21T04:03:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.