Spatial Transform Decoupling for Oriented Object Detection
- URL: http://arxiv.org/abs/2308.10561v2
- Date: Thu, 22 Feb 2024 07:18:57 GMT
- Title: Spatial Transform Decoupling for Oriented Object Detection
- Authors: Hongtian Yu, Yunjie Tian, Qixiang Ye, Yunfan Liu
- Abstract summary: Vision Transformers (ViTs) have achieved remarkable success in computer vision tasks.
We present a novel approach, termed Spatial Transform Decoupling (STD), providing a simple-yet-effective solution for oriented object detection with ViTs.
- Score: 43.44237345360947
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViTs) have achieved remarkable success in computer
vision tasks. However, their potential in rotation-sensitive scenarios has not
been fully explored, and this limitation may be inherently attributed to the
lack of spatial invariance in the data-forwarding process. In this study, we
present a novel approach, termed Spatial Transform Decoupling (STD), providing
a simple-yet-effective solution for oriented object detection with ViTs. Built
upon stacked ViT blocks, STD utilizes separate network branches to predict the
position, size, and angle of bounding boxes, effectively harnessing the spatial
transform potential of ViTs in a divide-and-conquer fashion. Moreover, by
aggregating cascaded activation masks (CAMs) computed upon the regressed
parameters, STD gradually enhances features within regions of interest (RoIs),
which complements the self-attention mechanism. Without bells and whistles, STD
achieves state-of-the-art performance on the benchmark datasets including
DOTA-v1.0 (82.24% mAP) and HRSC2016 (98.55% mAP), which demonstrates the
effectiveness of the proposed method. Source code is available at
https://github.com/yuhongtian17/Spatial-Transform-Decoupling.
Related papers
- TraIL-Det: Transformation-Invariant Local Feature Networks for 3D LiDAR Object Detection with Unsupervised Pre-Training [21.56675189346088]
We introduce Transformation-Invariant Local (TraIL) features and the associated TraIL-Det architecture.
TraIL features exhibit rigid transformation invariance and effectively adapt to variations in point density.
They utilize the inherent isotropic radiation of LiDAR to enhance local representation.
Our method outperforms contemporary self-supervised 3D object detection approaches in terms of mAP on KITTI.
arXiv Detail & Related papers (2024-08-25T17:59:17Z) - VST++: Efficient and Stronger Visual Saliency Transformer [74.26078624363274]
We develop an efficient and stronger VST++ model to explore global long-range dependencies.
We evaluate our model across various transformer-based backbones on RGB, RGB-D, and RGB-T SOD benchmark datasets.
arXiv Detail & Related papers (2023-10-18T05:44:49Z) - Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm.
FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z) - Semantic-Constraint Matching Transformer for Weakly Supervised Object
Localization [31.039698757869974]
Weakly supervised object localization (WSOL) strives to learn to localize objects with only image-level supervision.
Previous CNN-based methods suffer from partial activation issues, concentrating on the object's discriminative part instead of the entire entity scope.
We propose a novel Semantic-Constraint Matching Network (SCMN) via a transformer to converge on the divergent activation.
arXiv Detail & Related papers (2023-09-04T03:20:31Z) - Masked Momentum Contrastive Learning for Zero-shot Semantic
Understanding [39.424931953675994]
Self-supervised pretraining (SSP) has emerged as a popular technique in machine learning, enabling the extraction of meaningful feature representations without labelled data.
This study endeavours to evaluate the effectiveness of pure self-supervised learning (SSL) techniques in computer vision tasks.
arXiv Detail & Related papers (2023-08-22T13:55:57Z) - From Saliency to DINO: Saliency-guided Vision Transformer for Few-shot
Keypoint Detection [36.9781808268263]
Few-shot keypoint detection (FSKD) attempts to localize any keypoints, including novel or base keypoints, depending on the reference samples.
FSKD requires the semantically meaningful relations for keypoint similarity learning to overcome the ubiquitous noise and ambiguous local patterns.
We present a novel saliency-guided vision transformer, dubbed SalViT, for few-shot keypoint detection.
arXiv Detail & Related papers (2023-04-06T15:22:34Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Pose Discrepancy Spatial Transformer Based Feature Disentangling for
Partial Aspect Angles SAR Target Recognition [11.552273102567048]
This letter presents a novel framework termed DistSTN for the task of synthetic aperture radar (SAR) automatic target recognition (ATR)
In contrast to the conventional SAR ATR algorithms, DistSTN considers a more challenging practical scenario for non-cooperative targets.
We develop an amortized inference scheme that enables efficient feature extraction and recognition using an encoder-decoder mechanism.
arXiv Detail & Related papers (2021-03-07T11:47:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.