SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object
Detection with Transformers
- URL: http://arxiv.org/abs/2303.08481v1
- Date: Wed, 15 Mar 2023 09:36:58 GMT
- Title: SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object
Detection with Transformers
- Authors: Guoqiang Jin, Fan Yang, Mingshan Sun, Ruyi Zhao, Yakun Liu, Wei Li,
Tianpeng Bao, Liwei Wu, Xingyu Zeng, Rui Zhao
- Abstract summary: We propose SeqCo-DETR, a Sequence Consistency-based self-supervised method for object DEtection with TRansformers.
Our method achieves state-of-the-art results on MS COCO (45.8 AP) and PASCAL VOC (64.1 AP), demonstrating the effectiveness of our approach.
- Score: 18.803007408124156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-supervised pre-training and transformer-based networks have
significantly improved the performance of object detection. However, most of
the current self-supervised object detection methods are built on
convolutional-based architectures. We believe that the transformers' sequence
characteristics should be considered when designing a transformer-based
self-supervised method for the object detection task. To this end, we propose
SeqCo-DETR, a novel Sequence Consistency-based self-supervised method for
object DEtection with TRansformers. SeqCo-DETR defines a simple but effective
pretext by minimizes the discrepancy of the output sequences of transformers
with different image views as input and leverages bipartite matching to find
the most relevant sequence pairs to improve the sequence-level self-supervised
representation learning performance. Furthermore, we provide a mask-based
augmentation strategy incorporated with the sequence consistency strategy to
extract more representative contextual information about the object for the
object detection task. Our method achieves state-of-the-art results on MS COCO
(45.8 AP) and PASCAL VOC (64.1 AP), demonstrating the effectiveness of our
approach.
Related papers
- MASSFormer: Mobility-Aware Spectrum Sensing using Transformer-Driven
Tiered Structure [3.6194127685460553]
We develop a mobility-aware transformer-driven structure (MASSFormer) based cooperative sensing method.
Our method considers a dynamic scenario involving mobile primary users (PUs) and secondary users (SUs)
The proposed method is tested under imperfect reporting channel scenarios to show robustness.
arXiv Detail & Related papers (2024-09-26T05:25:25Z) - Unifying Feature and Cost Aggregation with Transformers for Semantic and Visual Correspondence [51.54175067684008]
This paper introduces a Transformer-based integrative feature and cost aggregation network designed for dense matching tasks.
We first show that feature aggregation and cost aggregation exhibit distinct characteristics and reveal the potential for substantial benefits stemming from the judicious use of both aggregation processes.
Our framework is evaluated on standard benchmarks for semantic matching, and also applied to geometric matching, where we show that our approach achieves significant improvements compared to existing methods.
arXiv Detail & Related papers (2024-03-17T07:02:55Z) - Consistency Learning via Decoding Path Augmentation for Transformers in
Human Object Interaction Detection [11.928724924319138]
We propose cross-path consistency learning (CPC) to improve HOI detection for transformers.
Our experiments demonstrate the effectiveness of our method, and we achieved significant improvement on V-COCO and HICO-DET.
arXiv Detail & Related papers (2022-04-11T02:45:00Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - Miti-DETR: Object Detection based on Transformers with Mitigatory
Self-Attention Convergence [17.854940064699985]
We propose a transformer architecture with a mitigatory self-attention mechanism.
Miti-DETR reserves the inputs of each single attention layer to the outputs of that layer so that the "non-attention" information has participated in attention propagation.
Miti-DETR significantly enhances the average detection precision and convergence speed towards existing DETR-based models.
arXiv Detail & Related papers (2021-12-26T03:23:59Z) - Aligning Pretraining for Detection via Object-Level Contrastive Learning [57.845286545603415]
Image-level contrastive representation learning has proven to be highly effective as a generic model for transfer learning.
We argue that this could be sub-optimal and thus advocate a design principle which encourages alignment between the self-supervised pretext task and the downstream task.
Our method, called Selective Object COntrastive learning (SoCo), achieves state-of-the-art results for transfer performance on COCO detection.
arXiv Detail & Related papers (2021-06-04T17:59:52Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.