DECO: Unleashing the Potential of ConvNets for Query-based Detection and Segmentation
- URL: http://arxiv.org/abs/2312.13735v2
- Date: Thu, 27 Feb 2025 14:58:37 GMT
- Title: DECO: Unleashing the Potential of ConvNets for Query-based Detection and Segmentation
- Authors: Xinghao Chen, Siwei Li, Yijing Yang, Yunhe Wang,
- Abstract summary: We propose a novel mechanism dubbed InterConv to perform interaction between object queries and image features via convolutional layers.<n>With the proposed InterConv, we build Detection ConvNet (DECO), which is composed of a backbone and convolutional encoder-decoder architecture.<n>Our DECO achieves competitive performance in terms of detection accuracy and running speed.
- Score: 22.19064240105095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer and its variants have shown great potential for various vision tasks in recent years, including image classification, object detection and segmentation. Meanwhile, recent studies also reveal that with proper architecture design, convolutional networks (ConvNets) also achieve competitive performance with transformers. However, no prior methods have explored to utilize pure convolution to build a Transformer-style Decoder module, which is essential for Encoder-Decoder architecture like Detection Transformer (DETR). To this end, in this paper we explore whether we could build query-based detection and segmentation framework with ConvNets instead of sophisticated transformer architecture. We propose a novel mechanism dubbed InterConv to perform interaction between object queries and image features via convolutional layers. Equipped with the proposed InterConv, we build Detection ConvNet (DECO), which is composed of a backbone and convolutional encoder-decoder architecture. We compare the proposed DECO against prior detectors on the challenging COCO benchmark. Despite its simplicity, our DECO achieves competitive performance in terms of detection accuracy and running speed. Specifically, with the ResNet-18 and ResNet-50 backbone, our DECO achieves $40.5\%$ and $47.8\%$ AP with $66$ and $34$ FPS, respectively. The proposed method is also evaluated on the segment anything task, demonstrating similar performance and higher efficiency. We hope the proposed method brings another perspective for designing architectures for vision tasks. Codes are available at https://github.com/xinghaochen/DECO and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/DECO.
Related papers
- Deep Equilibrium Object Detection [24.69829309391189]
We present a new query-based object detector (DEQDet) by designing a deep equilibrium decoder.
Our experiments demonstrate DEQDet converges faster, consumes less memory, and achieves better results than the baseline counterpart.
arXiv Detail & Related papers (2023-08-18T13:56:03Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Conv2Former: A Simple Transformer-Style ConvNet for Visual Recognition [158.15602882426379]
This paper does not attempt to design a state-of-the-art method for visual recognition but investigates a more efficient way to make use of convolutions to encode spatial features.
By comparing the design principles of the recent convolutional neural networks ConvNets) and Vision Transformers, we propose to simplify the self-attention by leveraging a convolutional modulation operation.
arXiv Detail & Related papers (2022-11-22T01:39:45Z) - HorNet: Efficient High-Order Spatial Interactions with Recursive Gated
Convolutions [109.33112814212129]
We show that input-adaptive, long-range and high-order spatial interactions can be efficiently implemented with a convolution-based framework.
We present the Recursive Gated Convolution ($textitgtextitn$Conv) that performs high-order spatial interactions with gated convolutions.
Based on the operation, we construct a new family of generic vision backbones named HorNet.
arXiv Detail & Related papers (2022-07-28T17:59:02Z) - Deep Gradient Learning for Efficient Camouflaged Object Detection [152.24312279220598]
This paper introduces DGNet, a novel deep framework that exploits object supervision for gradient camouflaged object detection (COD)
Benefiting from the simple but efficient framework, DGNet outperforms existing state-of-the-art COD models by a large margin.
Results also show that the proposed DGNet performs well in polyp segmentation, defect detection, and transparent object segmentation tasks.
arXiv Detail & Related papers (2022-05-25T15:25:18Z) - A Unified Transformer Framework for Group-based Segmentation:
Co-Segmentation, Co-Saliency Detection and Video Salient Object Detection [59.21990697929617]
Humans tend to mine objects by learning from a group of images or several frames of video since we live in a dynamic world.
Previous approaches design different networks on similar tasks separately, and they are difficult to apply to each other.
We introduce a unified framework to tackle these issues, term as UFO (UnifiedObject Framework for Co-Object Framework)
arXiv Detail & Related papers (2022-03-09T13:35:19Z) - Anchor DETR: Query Design for Transformer-Based Detector [24.925317590675203]
We propose a novel query design for the transformer-based detectors.
Object queries are based on anchor points, which are widely used in CNN-based detectors.
Our design can predict multiple objects at one position to solve the difficulty: "one region, multiple objects"
arXiv Detail & Related papers (2021-09-15T06:31:55Z) - Vision Transformer Architecture Search [64.73920718915282]
Current vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks.
We propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets.
Our searched architecture achieves $74.7%$ top-$1$ accuracy on ImageNet and is $2.5%$ superior than the current baseline ViT architecture.
arXiv Detail & Related papers (2021-06-25T15:39:08Z) - Oriented Object Detection with Transformer [51.634913687632604]
We implement Oriented Object DEtection with TRansformer ($bf O2DETR$) based on an end-to-end network.
We design a simple but highly efficient encoder for Transformer by replacing the attention mechanism with depthwise separable convolution.
Our $rm O2DETR$ can be another new benchmark in the field of oriented object detection, which achieves up to 3.85 mAP improvement over Faster R-CNN and RetinaNet.
arXiv Detail & Related papers (2021-06-06T14:57:17Z) - Efficient DETR: Improving End-to-End Object Detector with Dense Prior [7.348184873564071]
We propose Efficient DETR, a simple and efficient pipeline for end-to-end object detection.
By taking advantage of both dense detection and sparse set detection, Efficient DETR leverages dense prior to initialize the object containers.
Experiments conducted on MS COCO show that our method, with only 3 encoder layers and 1 decoder layer, achieves competitive performance with state-of-the-art object detection methods.
arXiv Detail & Related papers (2021-04-03T06:14:24Z) - Learning Spatio-Temporal Transformer for Visual Tracking [108.11680070733598]
We present a new tracking architecture with an encoder-decoder transformer as the key component.
The whole method is end-to-end, does not need any postprocessing steps such as cosine window and bounding box smoothing.
The proposed tracker achieves state-of-the-art performance on five challenging short-term and long-term benchmarks, while running real-time speed, being 6x faster than Siam R-CNN.
arXiv Detail & Related papers (2021-03-31T15:19:19Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z) - VarifocalNet: An IoU-aware Dense Object Detector [11.580759212782812]
We learn an Iou-aware Classification Score (IACS) as a joint representation of object presence confidence and localization accuracy.
We show that dense object detectors can achieve a more accurate ranking of candidate detections based on the IACS.
We build an IoU-aware dense object detector based on the FCOS+ATSS architecture, that we call VarifocalNet or VFNet for short.
arXiv Detail & Related papers (2020-08-31T05:12:21Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.