Visual Transformers with Primal Object Queries for Multi-Label Image
Classification
- URL: http://arxiv.org/abs/2112.05485v1
- Date: Fri, 10 Dec 2021 12:29:07 GMT
- Title: Visual Transformers with Primal Object Queries for Multi-Label Image
Classification
- Authors: Vacit Oguz Yazici, Joost van de Weijer, Longlong Yu
- Abstract summary: We propose the usage of primal object queries that are only provided at the start of the transformer decoder stack.
The proposed transformer model with primal object queries improves the state-of-the-art class wise F1 metric by 2.1% and 1.8%.
- Score: 32.63955272381003
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-label image classification is about predicting a set of class labels
that can be considered as orderless sequential data. Transformers process the
sequential data as a whole, therefore they are inherently good at set
prediction. The first vision-based transformer model, which was proposed for
the object detection task introduced the concept of object queries. Object
queries are learnable positional encodings that are used by attention modules
in decoder layers to decode the object classes or bounding boxes using the
region of interests in an image. However, inputting the same set of object
queries to different decoder layers hinders the training: it results in lower
performance and delays convergence. In this paper, we propose the usage of
primal object queries that are only provided at the start of the transformer
decoder stack. In addition, we improve the mixup technique proposed for
multi-label classification. The proposed transformer model with primal object
queries improves the state-of-the-art class wise F1 metric by 2.1% and 1.8%;
and speeds up the convergence by 79.0% and 38.6% on MS-COCO and NUS-WIDE
datasets respectively.
Related papers
- Fusion Transformer with Object Mask Guidance for Image Forgery Analysis [9.468075384561947]
We introduce OMG-Fuser, a fusion transformer-based network designed to extract information from various forensic signals.
Our approach can operate with an arbitrary number of forensic signals and leverages object information for their analysis.
Our model is robust against traditional and novel forgery attacks and can be expanded with new signals without training from scratch.
arXiv Detail & Related papers (2024-03-18T20:20:13Z) - Learning Dynamic Query Combinations for Transformer-based Object
Detection and Segmentation [37.24532930188581]
Transformer-based detection and segmentation methods use a list of learned detection queries to retrieve information from the transformer network.
We empirically find that random convex combinations of the learned queries are still good for the corresponding models.
We propose to learn a convex combination with dynamic coefficients based on the high-level semantics of the image.
arXiv Detail & Related papers (2023-07-23T06:26:27Z) - Language-aware Multiple Datasets Detection Pretraining for DETRs [4.939595148195813]
We propose a framework for utilizing Multiple datasets to pretrain DETR-like detectors, termed METR.
It converts the typical multi-classification in object detection into binary classification by introducing a pre-trained language model.
We show METR achieves extraordinary results on either multi-task joint training or the pretrain & finetune paradigm.
arXiv Detail & Related papers (2023-04-07T10:34:04Z) - FAQ: Feature Aggregated Queries for Transformer-based Video Object
Detectors [37.38250825377456]
We take a different perspective on video object detection. In detail, we improve the qualities of queries for the Transformer-based models by aggregation.
On the challenging ImageNet VID benchmark, when integrated with our proposed modules, the current state-of-the-art Transformer-based object detectors can be improved by more than 2.4% on mAP and 4.2% on AP50.
arXiv Detail & Related papers (2023-03-15T02:14:56Z) - Location-Aware Self-Supervised Transformers [74.76585889813207]
We propose to pretrain networks for semantic segmentation by predicting the relative location of image parts.
We control the difficulty of the task by masking a subset of the reference patch features visible to those of the query.
Our experiments show that this location-aware pretraining leads to representations that transfer competitively to several challenging semantic segmentation benchmarks.
arXiv Detail & Related papers (2022-12-05T16:24:29Z) - High-Quality Entity Segmentation [110.55724145851725]
CropFormer is designed to tackle the intractability of instance-level segmentation on high-resolution images.
It improves mask prediction by fusing high-res image crops that provide more fine-grained image details and the full image.
With CropFormer, we achieve a significant AP gain of $1.9$ on the challenging entity segmentation task.
arXiv Detail & Related papers (2022-11-10T18:58:22Z) - Intermediate Prototype Mining Transformer for Few-Shot Semantic
Segmentation [119.51445225693382]
Few-shot semantic segmentation aims to segment the target objects in query under the condition of a few annotated support images.
We introduce an intermediate prototype for mining both deterministic category information from the support and adaptive category knowledge from the query.
In each IPMT layer, we propagate the object information in both support and query features to the prototype and then use it to activate the query feature map.
arXiv Detail & Related papers (2022-10-13T06:45:07Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Query2Label: A Simple Transformer Way to Multi-Label Classification [37.206922180245265]
This paper presents a simple and effective approach to solving the multi-label classification problem.
The proposed approach leverages Transformer decoders to query the existence of a class label.
Compared with prior works, the new framework is simple, using standard Transformers and vision backbones, and effective.
arXiv Detail & Related papers (2021-07-22T17:49:25Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization.
We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning.
Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.