Focusing on what to decode and what to train: Efficient Training with
HOI Split Decoders and Specific Target Guided DeNoising
- URL: http://arxiv.org/abs/2307.02291v2
- Date: Mon, 4 Sep 2023 15:03:11 GMT
- Title: Focusing on what to decode and what to train: Efficient Training with
HOI Split Decoders and Specific Target Guided DeNoising
- Authors: Junwen Chen, Yingcheng Wang, Keiji Yanai
- Abstract summary: Recent one-stage transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOI) task by leveraging the detection of DETR.
We propose a novel one-stage framework (SOV) which consists of a subject decoder, an object decoder, and a verb decoder.
We propose a novel Specific Target Guided (STG) DeNoising training strategy, which leverages learnable object and verb label embeddings to guide the training and accelerate the training convergence.
- Score: 17.268302302974607
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent one-stage transformer-based methods achieve notable gains in the
Human-object Interaction Detection (HOI) task by leveraging the detection of
DETR. However, the current methods redirect the detection target of the object
decoder, and the box target is not explicitly separated from the query
embeddings, which leads to long and hard training. Furthermore, matching the
predicted HOI instances with the ground-truth is more challenging than object
detection, simply adapting training strategies from the object detection makes
the training more difficult. To clear the ambiguity between human and object
detection and share the prediction burden, we propose a novel one-stage
framework (SOV), which consists of a subject decoder, an object decoder, and a
verb decoder. Moreover, we propose a novel Specific Target Guided (STG)
DeNoising training strategy, which leverages learnable object and verb label
embeddings to guide the training and accelerate the training convergence. In
addition, for the inference part, the label-specific information is directly
fed into the decoders by initializing the query embeddings from the learnable
label embeddings. Without additional features or prior language knowledge, our
method (SOV-STG) achieves higher accuracy than the state-of-the-art method in
one-third of training epochs. The code is available at this
https://github.com/cjw2021/SOV-STG.
Related papers
- A Fresh Take on Stale Embeddings: Improving Dense Retriever Training with Corrector Networks [81.2624272756733]
In dense retrieval, deep encoders provide embeddings for both inputs and targets.
We train a small parametric corrector network that adjusts stale cached target embeddings.
Our approach matches state-of-the-art results even when no target embedding updates are made during training.
arXiv Detail & Related papers (2024-09-03T13:29:13Z) - Few-Shot Object Detection with Sparse Context Transformers [37.106378859592965]
Few-shot detection is a major task in pattern recognition which seeks to localize objects using models trained with few labeled data.
We propose a novel sparse context transformer (SCT) that effectively leverages object knowledge in the source domain, and automatically learns a sparse context from only few training images in the target domain.
We evaluate the proposed method on two challenging few-shot object detection benchmarks, and empirical results show that the proposed method obtains competitive performance compared to the related state-of-the-art.
arXiv Detail & Related papers (2024-02-14T17:10:01Z) - Aligned Unsupervised Pretraining of Object Detectors with Self-training [41.03780087924593]
Unsupervised pretraining of object detectors has recently become a key component of object detector training.
We propose a framework that mitigates this issue and consists of three simple yet key ingredients.
We show that our strategy is also capable of pretraining from scratch (including the backbone) and works on complex images like COCO.
arXiv Detail & Related papers (2023-07-28T17:46:00Z) - Label-Efficient Object Detection via Region Proposal Network
Pre-Training [58.50615557874024]
We propose a simple pretext task that provides an effective pre-training for the region proposal network (RPN)
In comparison with multi-stage detectors without RPN pre-training, our approach is able to consistently improve downstream task performance.
arXiv Detail & Related papers (2022-11-16T16:28:18Z) - Label, Verify, Correct: A Simple Few Shot Object Detection Method [93.84801062680786]
We introduce a simple pseudo-labelling method to source high-quality pseudo-annotations from a training set.
We present two novel methods to improve the precision of the pseudo-labelling process.
Our method achieves state-of-the-art or second-best performance compared to existing approaches.
arXiv Detail & Related papers (2021-12-10T18:59:06Z) - Activation to Saliency: Forming High-Quality Labels for Unsupervised
Salient Object Detection [54.92703325989853]
We propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues.
No human annotations are involved in our framework during the whole training process.
Our framework reports significant performance compared with existing USOD methods.
arXiv Detail & Related papers (2021-12-07T11:54:06Z) - Aligning Pretraining for Detection via Object-Level Contrastive Learning [57.845286545603415]
Image-level contrastive representation learning has proven to be highly effective as a generic model for transfer learning.
We argue that this could be sub-optimal and thus advocate a design principle which encourages alignment between the self-supervised pretext task and the downstream task.
Our method, called Selective Object COntrastive learning (SoCo), achieves state-of-the-art results for transfer performance on COCO detection.
arXiv Detail & Related papers (2021-06-04T17:59:52Z) - LabelEnc: A New Intermediate Supervision Method for Object Detection [78.74368141062797]
We propose a new intermediate supervision method, named LabelEnc, to boost the training of object detection systems.
The key idea is to introduce a novel label encoding function, mapping the ground-truth labels into latent embedding.
Experiments show our method improves a variety of detection systems by around 2% on COCO dataset.
arXiv Detail & Related papers (2020-07-07T08:55:05Z) - Context-Transformer: Tackling Object Confusion for Few-Shot Detection [0.0]
We propose a novel Context-Transformer within a concise deep transfer framework.
Context-Transformer can effectively leverage source-domain object knowledge as guidance.
It can adaptively integrate these relational clues to enhance the discriminative power of detector.
arXiv Detail & Related papers (2020-03-16T16:17:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.