Transformers in Small Object Detection: A Benchmark and Survey of
State-of-the-Art
- URL: http://arxiv.org/abs/2309.04902v1
- Date: Sun, 10 Sep 2023 00:08:29 GMT
- Title: Transformers in Small Object Detection: A Benchmark and Survey of
State-of-the-Art
- Authors: Aref Miri Rekavandi, Shima Rashidi, Farid Boussaid, Stephen Hoefs,
Emre Akbas, Mohammed bennamoun
- Abstract summary: Transformers consistently outperformed well-established CNN-based detectors in almost every video or image dataset.
Small objects have been identified as one of the most challenging object types in detection frameworks.
This survey presents a taxonomy of over 60 research studies on developed transformers for the task of small object detection.
- Score: 34.077422623505804
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transformers have rapidly gained popularity in computer vision, especially in
the field of object recognition and detection. Upon examining the outcomes of
state-of-the-art object detection methods, we noticed that transformers
consistently outperformed well-established CNN-based detectors in almost every
video or image dataset. While transformer-based approaches remain at the
forefront of small object detection (SOD) techniques, this paper aims to
explore the performance benefits offered by such extensive networks and
identify potential reasons for their SOD superiority. Small objects have been
identified as one of the most challenging object types in detection frameworks
due to their low visibility. We aim to investigate potential strategies that
could enhance transformers' performance in SOD. This survey presents a taxonomy
of over 60 research studies on developed transformers for the task of SOD,
spanning the years 2020 to 2023. These studies encompass a variety of detection
applications, including small object detection in generic images, aerial
images, medical images, active millimeter images, underwater images, and
videos. We also compile and present a list of 12 large-scale datasets suitable
for SOD that were overlooked in previous studies and compare the performance of
the reviewed studies using popular metrics such as mean Average Precision
(mAP), Frames Per Second (FPS), number of parameters, and more. Researchers can
keep track of newer studies on our web page, which is available at
\url{https://github.com/arekavandi/Transformer-SOD}.
Related papers
- Bridging the Performance Gap between DETR and R-CNN for Graphical Object
Detection in Document Images [11.648151981111436]
This paper takes an important step in bridging the performance gap between DETR and R-CNN for graphical object detection.
We modify object queries in different ways, using points, anchor boxes and adding positive and negative noise to the anchors to boost performance.
We evaluate our approach on the four graphical datasets: PubTables, TableBank, NTable and PubLaynet.
arXiv Detail & Related papers (2023-06-23T14:46:03Z) - Object Detection with Transformers: A Review [11.255962936937744]
This paper provides a comprehensive review of 21 recently proposed advancements in the original DETR model.
We conduct a comparative analysis across various detection transformers, evaluating their performance and network architectures.
We hope that this study will ignite further interest among researchers in addressing the existing challenges and exploring the application of transformers in the object detection domain.
arXiv Detail & Related papers (2023-06-07T16:13:38Z) - Aerial Image Object Detection With Vision Transformer Detector (ViTDet) [0.0]
Vision Transformer Detector (ViTDet) was proposed to extract multi-scale features for object detection.
ViTDet's simple design achieves good performance on natural scene images and can be easily embedded into any detector architecture.
Our results show that ViTDet can consistently outperform its convolutional neural network counterparts on horizontal bounding box (HBB) object detection.
arXiv Detail & Related papers (2023-01-28T02:25:30Z) - Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors.
First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning.
Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - Searching Intrinsic Dimensions of Vision Transformers [6.004704152622424]
We propose SiDT, a method for pruning vision transformer backbones on more complicated vision tasks like object detection.
Experiments on CIFAR-100 and COCO datasets show that the backbones with 20% or 40% dimensions/ parameters pruned can have similar or even better performance than the unpruned models.
arXiv Detail & Related papers (2022-04-16T05:16:35Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - DA-DETR: Domain Adaptive Detection Transformer with Information Fusion [53.25930448542148]
DA-DETR is a domain adaptive object detection transformer that introduces information fusion for effective transfer from a labeled source domain to an unlabeled target domain.
We introduce a novel CNN-Transformer Blender (CTBlender) that fuses the CNN features and Transformer features ingeniously for effective feature alignment and knowledge transfer across domains.
CTBlender employs the Transformer features to modulate the CNN features across multiple scales where the high-level semantic information and the low-level spatial information are fused for accurate object identification and localization.
arXiv Detail & Related papers (2021-03-31T13:55:56Z) - Robust and Accurate Object Detection via Adversarial Learning [111.36192453882195]
This work augments the fine-tuning stage for object detectors by exploring adversarial examples.
Our approach boosts the performance of state-of-the-art EfficientDets by +1.1 mAP on the object detection benchmark.
arXiv Detail & Related papers (2021-03-23T19:45:26Z) - Perceiving Traffic from Aerial Images [86.994032967469]
We propose an object detection method called Butterfly Detector that is tailored to detect objects in aerial images.
We evaluate our Butterfly Detector on two publicly available UAV datasets (UAVDT and VisDrone 2019) and show that it outperforms previous state-of-the-art methods while remaining real-time.
arXiv Detail & Related papers (2020-09-16T11:37:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.