ComplETR: Reducing the cost of annotations for object detection in dense
scenes with vision transformers
- URL: http://arxiv.org/abs/2209.05654v1
- Date: Tue, 13 Sep 2022 00:11:16 GMT
- Title: ComplETR: Reducing the cost of annotations for object detection in dense
scenes with vision transformers
- Authors: Achin Jain, Kibok Lee, Gurumurthy Swaminathan, Hao Yang, Bernt
Schiele, Avinash Ravichandran, Onkar Dabeer
- Abstract summary: ComplETR is designed to explicitly complete missing annotations in partially annotated dense scene datasets.
This reduces the need to annotate every object instance in the scene thereby reducing annotation cost.
We show performance improvement for several popular detectors such as Faster R-CNN, Cascade R-CNN, CenterNet2, and Deformable DETR.
- Score: 73.29057814695459
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Annotating bounding boxes for object detection is expensive, time-consuming,
and error-prone. In this work, we propose a DETR based framework called
ComplETR that is designed to explicitly complete missing annotations in
partially annotated dense scene datasets. This reduces the need to annotate
every object instance in the scene thereby reducing annotation cost. ComplETR
augments object queries in DETR decoder with patch information of objects in
the image. Combined with a matching loss, it can effectively find objects that
are similar to the input patch and complete the missing annotations. We show
that our framework outperforms the state-of-the-art methods such as Soft
Sampling and Unbiased Teacher by itself, while at the same time can be used in
conjunction with these methods to further improve their performance. Our
framework is also agnostic to the choice of the downstream object detectors; we
show performance improvement for several popular detectors such as Faster
R-CNN, Cascade R-CNN, CenterNet2, and Deformable DETR on multiple dense scene
datasets.
Related papers
- VADet: Multi-frame LiDAR 3D Object Detection using Variable Aggregation [4.33608942673382]
We propose an efficient adaptive method, which we call VADet, for variable aggregation.
VADet performs aggregation per object, with the number of frames determined by an object's observed properties, such as speed and point density.
To demonstrate its benefits, we apply VADet to three popular single-stage detectors and achieve state-of-the-art performance on a dataset.
arXiv Detail & Related papers (2024-11-20T10:36:41Z) - Decoupled DETR: Spatially Disentangling Localization and Classification
for Improved End-to-End Object Detection [48.429555904690595]
We introduce spatially decoupled DETR, which includes a task-aware query generation module and a disentangled feature learning process.
We demonstrate that our approach achieves a significant improvement in MSCOCO datasets compared to previous work.
arXiv Detail & Related papers (2023-10-24T15:54:11Z) - Object-Centric Multiple Object Tracking [124.30650395969126]
This paper proposes a video object-centric model for multiple-object tracking pipelines.
It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module.
Benefited from object-centric learning, we only require sparse detection labels for object localization and feature binding.
arXiv Detail & Related papers (2023-09-01T03:34:12Z) - Correspondence Matters for Video Referring Expression Comprehension [64.60046797561455]
Video Referring Expression (REC) aims to localize the referent objects described in the sentence to visual regions in the video frames.
Existing methods suffer from two problems: 1) inconsistent localization results across video frames; 2) confusion between the referent and contextual objects.
We propose a novel Dual Correspondence Network (dubbed as DCNet) which explicitly enhances the dense associations in both the inter-frame and cross-modal manners.
arXiv Detail & Related papers (2022-07-21T10:31:39Z) - Omni-DETR: Omni-Supervised Object Detection with Transformers [165.4190908259015]
We consider the problem of omni-supervised object detection, which can use unlabeled, fully labeled and weakly labeled annotations.
Under this unified architecture, different types of weak labels can be leveraged to generate accurate pseudo labels.
We have found that weak annotations can help to improve detection performance and a mixture of them can achieve a better trade-off between annotation cost and accuracy.
arXiv Detail & Related papers (2022-03-30T06:36:09Z) - Dense Relation Distillation with Context-aware Aggregation for Few-Shot
Object Detection [18.04185751827619]
Few-shot object detection is challenging since the fine-grained feature of novel object can be easily overlooked with only a few data available.
We propose Dense Relation Distillation with Context-aware Aggregation (DCNet) to tackle the few-shot detection problem.
arXiv Detail & Related papers (2021-03-30T05:34:49Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z) - Robust Object Detection under Occlusion with Context-Aware
CompositionalNets [21.303976151518125]
Compositional convolutional neural networks (CompositionalNets) have been shown to be robust at classifying occluded objects.
We propose to overcome two limitations of CompositionalNets which will enable them to detect partially occluded objects.
arXiv Detail & Related papers (2020-05-24T02:57:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.