CAT: Cross-Attention Transformer for One-Shot Object Detection
- URL: http://arxiv.org/abs/2104.14984v1
- Date: Fri, 30 Apr 2021 13:18:53 GMT
- Title: CAT: Cross-Attention Transformer for One-Shot Object Detection
- Authors: Weidong Lin, Yuyan Deng, Yang Gao, Ning Wang, Jinghao Zhou, Lingqiao
Liu, Lei Zhang, Peng Wang
- Abstract summary: One-shot object detection aims to detect all instances of that class in a target image through semantic similarity comparison.
We present a universal Cross-Attention Transformer (CAT) module for accurate and efficient semantic similarity comparison in one-shot object detection.
- Score: 32.50786038822194
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Given a query patch from a novel class, one-shot object detection aims to
detect all instances of that class in a target image through the semantic
similarity comparison. However, due to the extremely limited guidance in the
novel class as well as the unseen appearance difference between query and
target instances, it is difficult to appropriately exploit their semantic
similarity and generalize well. To mitigate this problem, we present a
universal Cross-Attention Transformer (CAT) module for accurate and efficient
semantic similarity comparison in one-shot object detection. The proposed CAT
utilizes transformer mechanism to comprehensively capture bi-directional
correspondence between any paired pixels from the query and the target image,
which empowers us to sufficiently exploit their semantic characteristics for
accurate similarity comparison. In addition, the proposed CAT enables feature
dimensionality compression for inference speedup without performance loss.
Extensive experiments on COCO, VOC, and FSOD under one-shot settings
demonstrate the effectiveness and efficiency of our method, e.g., it surpasses
CoAE, a major baseline in this task by 1.0% in AP on COCO and runs nearly 2.5
times faster. Code will be available in the future.
Related papers
- Practical Video Object Detection via Feature Selection and Aggregation [18.15061460125668]
Video object detection (VOD) needs to concern the high across-frame variation in object appearance, and the diverse deterioration in some frames.
Most of contemporary aggregation methods are tailored for two-stage detectors, suffering from high computational costs.
This study invents a very simple yet potent strategy of feature selection and aggregation, gaining significant accuracy at marginal computational expense.
arXiv Detail & Related papers (2024-07-29T02:12:11Z) - Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression [63.23578860867408]
We investigate how to integrate the evaluations of importance and sparsity scores into a single stage.
We present OFB, a cost-efficient approach that simultaneously evaluates both importance and sparsity scores.
Experiments demonstrate that OFB can achieve superior compression performance over state-of-the-art searching-based and pruning-based methods.
arXiv Detail & Related papers (2024-03-23T13:22:36Z) - Target-aware Bi-Transformer for Few-shot Segmentation [4.3753381458828695]
Few-shot semantic segmentation (FSS) aims to use limited labeled support images to identify the segmentation of new classes of objects.
In this paper, we propose the Target-aware Bi-Transformer Network (TBTNet) to equivalent treat of support images and query image.
A vigorous Target-aware Transformer Layer (TTL) also be designed to distill correlations and force the model to focus on foreground information.
arXiv Detail & Related papers (2023-09-18T05:28:51Z) - Adaptive Sparse Convolutional Networks with Global Context Enhancement
for Faster Object Detection on Drone Images [26.51970603200391]
This paper investigates optimizing the detection head based on the sparse convolution.
It suffers from inadequate integration of contextual information of tiny objects.
We propose a novel global context-enhanced adaptive sparse convolutional network.
arXiv Detail & Related papers (2023-03-25T14:42:50Z) - Enhancing Few-shot Image Classification with Cosine Transformer [4.511561231517167]
Few-shot Cosine Transformer (FS-CT) is a relational map between supports and queries.
Our method performs competitive results in mini-ImageNet, CUB-200, and CIFAR-FS on 1-shot learning and 5-shot learning tasks.
Our FS-CT with cosine attention is a lightweight, simple few-shot algorithm that can be applied for a wide range of applications.
arXiv Detail & Related papers (2022-11-13T06:03:28Z) - ECO-TR: Efficient Correspondences Finding Via Coarse-to-Fine Refinement [80.94378602238432]
We propose an efficient structure named Correspondence Efficient Transformer (ECO-TR) by finding correspondences in a coarse-to-fine manner.
To achieve this, multiple transformer blocks are stage-wisely connected to gradually refine the predicted coordinates.
Experiments on various sparse and dense matching tasks demonstrate the superiority of our method in both efficiency and effectiveness against existing state-of-the-arts.
arXiv Detail & Related papers (2022-09-25T13:05:33Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding [14.896822373116729]
We present Few-Shot object detection via Contrastive proposals (FSCE)
FSCE is a simple yet effective approach to learning contrastive-aware object encodings that facilitate the classification of detected objects.
Our design outperforms current state-of-the-art works in any shot and all data, with up to +8.8% on standard benchmark PASCAL VOC and +2.7% on challenging benchmark.
arXiv Detail & Related papers (2021-03-10T09:15:05Z) - DetCo: Unsupervised Contrastive Learning for Object Detection [64.22416613061888]
Unsupervised contrastive learning achieves great success in learning image representations with CNN.
We present a novel contrastive learning approach, named DetCo, which fully explores the contrasts between global image and local image patches.
DetCo consistently outperforms supervised method by 1.6/1.2/1.0 AP on Mask RCNN-C4/FPN/RetinaNet with 1x schedule.
arXiv Detail & Related papers (2021-02-09T12:47:20Z) - Dense Contrastive Learning for Self-Supervised Visual Pre-Training [102.15325936477362]
We present dense contrastive learning, which implements self-supervised learning by optimizing a pairwise contrastive (dis)similarity loss at the pixel level between two views of input images.
Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only 1% slower)
arXiv Detail & Related papers (2020-11-18T08:42:32Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.