Simple Training Strategies and Model Scaling for Object Detection
- URL: http://arxiv.org/abs/2107.00057v1
- Date: Wed, 30 Jun 2021 18:41:47 GMT
- Title: Simple Training Strategies and Model Scaling for Object Detection
- Authors: Xianzhi Du, Barret Zoph, Wei-Chih Hung, Tsung-Yi Lin
- Abstract summary: We benchmark improvements on the vanilla ResNet-FPN backbone with RetinaNet and RCNN detectors.
The vanilla detectors are improved by 7.7% in accuracy while being 30% faster in speed.
Our largest Cascade RCNN-RS models achieve 52.9% AP with a ResNet152-FPN backbone and 53.6% with a SpineNet143L backbone.
- Score: 38.27709720726833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The speed-accuracy Pareto curve of object detection systems have advanced
through a combination of better model architectures, training and inference
methods. In this paper, we methodically evaluate a variety of these techniques
to understand where most of the improvements in modern detection systems come
from. We benchmark these improvements on the vanilla ResNet-FPN backbone with
RetinaNet and RCNN detectors. The vanilla detectors are improved by 7.7% in
accuracy while being 30% faster in speed. We further provide simple scaling
strategies to generate family of models that form two Pareto curves, named
RetinaNet-RS and Cascade RCNN-RS. These simple rescaled detectors explore the
speed-accuracy trade-off between the one-stage RetinaNet detectors and
two-stage RCNN detectors. Our largest Cascade RCNN-RS models achieve 52.9% AP
with a ResNet152-FPN backbone and 53.6% with a SpineNet143L backbone. Finally,
we show the ResNet architecture, with three minor architectural changes,
outperforms EfficientNet as the backbone for object detection and instance
segmentation systems.
Related papers
- CLRKDNet: Speeding up Lane Detection with Knowledge Distillation [4.015241891536452]
We introduce CLRKDNet, a streamlined model that balances detection accuracy with real-time performance.
Our method reduces inference time by up to 60% while maintaining detection accuracy comparable to CLRNet.
arXiv Detail & Related papers (2024-05-21T05:20:04Z) - TCCT-Net: Two-Stream Network Architecture for Fast and Efficient Engagement Estimation via Behavioral Feature Signals [58.865901821451295]
We present a novel two-stream feature fusion "Tensor-Convolution and Convolution-Transformer Network" (TCCT-Net) architecture.
To better learn the meaningful patterns in the temporal-spatial domain, we design a "CT" stream that integrates a hybrid convolutional-transformer.
In parallel, to efficiently extract rich patterns from the temporal-frequency domain, we introduce a "TC" stream that uses Continuous Wavelet Transform (CWT) to represent information in a 2D tensor form.
arXiv Detail & Related papers (2024-04-15T06:01:48Z) - Rock Classification Based on Residual Networks [4.256045122451066]
We propose two approaches using residual neural networks to tackle the problem of rock classification.
By modifying kernel sizes, normalization methods and composition based on ResNet34, we achieve an accuracy of 70.1% on the test dataset.
Using a similar backbone like BoTNet that incorporates multihead self attention, we additionally use internal residual connections in our model.
This boosts the model's performance, achieving an accuracy of 73.7% on the test dataset.
arXiv Detail & Related papers (2024-02-19T04:45:15Z) - Optimizing Anchor-based Detectors for Autonomous Driving Scenes [22.946814647030667]
This paper summarizes model improvements and inference-time optimizations for the popular anchor-based detectors in autonomous driving scenes.
Based on the high-performing RCNN-RS and RetinaNet-RS detection frameworks, we study a set of framework improvements to adapt the detectors to better detect small objects in crowd scenes.
arXiv Detail & Related papers (2022-08-11T22:44:59Z) - DETR++: Taming Your Multi-Scale Detection Transformer [22.522422934209807]
We introduce the Transformer-based detection method, i.e., DETR.
Due to the quadratic complexity in the self-attention mechanism in the Transformer, DETR is never able to incorporate multi-scale features.
We propose DETR++, a new architecture that improves detection results by 1.9% AP on MS COCO 2017, 11.5% AP on RICO icon detection, and 9.1% AP on RICO layout extraction.
arXiv Detail & Related papers (2022-06-07T02:38:31Z) - EResFD: Rediscovery of the Effectiveness of Standard Convolution for
Lightweight Face Detection [13.357235715178584]
We re-examine the effectiveness of the standard convolutional block as a lightweight backbone architecture for face detection.
We show that heavily channel-pruned standard convolution layers can achieve better accuracy and inference speed.
Our proposed detector EResFD obtained 80.4% mAP on WIDER FACE Hard subset which only takes 37.7 ms for VGA image inference on CPU.
arXiv Detail & Related papers (2022-04-04T02:30:43Z) - Oriented R-CNN for Object Detection [61.78746189807462]
This work proposes an effective and simple oriented object detection framework, termed Oriented R-CNN.
In the first stage, we propose an oriented Region Proposal Network (oriented RPN) that directly generates high-quality oriented proposals in a nearly cost-free manner.
The second stage is oriented R-CNN head for refining oriented Regions of Interest (oriented RoIs) and recognizing them.
arXiv Detail & Related papers (2021-08-12T12:47:43Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - Bottleneck Transformers for Visual Recognition [97.16013761605254]
We present BoTNet, a powerful backbone architecture that incorporates self-attention for vision tasks.
We present models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark.
We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.
arXiv Detail & Related papers (2021-01-27T18:55:27Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.