Contrastive Proposal Extension with LSTM Network for Weakly Supervised
Object Detection
- URL: http://arxiv.org/abs/2110.07511v2
- Date: Sat, 16 Oct 2021 12:17:18 GMT
- Title: Contrastive Proposal Extension with LSTM Network for Weakly Supervised
Object Detection
- Authors: Pei Lv, Suqi Hu, Tianran Hao, Haohan Ji, Lisha Cui, Haoyi Fan,
Mingliang Xu and Changsheng Xu
- Abstract summary: Weakly supervised object detection (WSOD) has attracted more and more attention since it only uses image-level labels and can save huge annotation costs.
We propose a new method by comparing the initial proposals and the extension ones to optimize those initial proposals.
Experiments on PASCAL VOC 2007, VOC 2012 and MS-COCO datasets show that our method has achieved the state-of-the-art results.
- Score: 52.86681130880647
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Weakly supervised object detection (WSOD) has attracted more and more
attention since it only uses image-level labels and can save huge annotation
costs. Most of the WSOD methods use Multiple Instance Learning (MIL) as their
basic framework, which regard it as an instance classification problem.
However, these methods based on MIL tends to converge only on the most
discriminate regions of different instances, rather than their corresponding
complete regions, that is, insufficient integrity. Inspired by the habit of
observing things by the human, we propose a new method by comparing the initial
proposals and the extension ones to optimize those initial proposals.
Specifically, we propose one new strategy for WSOD by involving contrastive
proposal extension (CPE), which consists of multiple directional contrastive
proposal extensions (D-CPE), and each D-CPE contains encoders based on LSTM
network and corresponding decoders. Firstly, the boundary of initial proposals
in MIL is extended to different positions according to well-designed sequential
order. Then, CPE compares the extended proposal and the initial proposal by
extracting the feature semantics of them using the encoders, and calculates the
integrity of the initial proposal to optimize the score of the initial
proposal. These contrastive contextual semantics will guide the basic WSOD to
suppress bad proposals and improve the scores of good ones. In addition, a
simple two-stream network is designed as the decoder to constrain the temporal
coding of LSTM and improve the performance of WSOD further. Experiments on
PASCAL VOC 2007, VOC 2012 and MS-COCO datasets show that our method has
achieved the state-of-the-art results.
Related papers
- Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - CPN: Complementary Proposal Network for Unconstrained Text Detection [7.524080426954018]
We propose a Complementary Proposal Network that seamlessly integrates semantic and geometric information for superior performance.
By leveraging both complementary proposals and features, CPN outperforms state-of-the-art approaches with significant margins under comparable cost.
arXiv Detail & Related papers (2024-02-18T10:43:53Z) - PETDet: Proposal Enhancement for Two-Stage Fine-Grained Object Detection [26.843891792018447]
We present PETDet (Proposal Enhancement for Two-stage fine-grained object detection) to better handle the sub-tasks in two-stage FGOD methods.
An anchor-free Quality Oriented Proposal Network (QOPN) is proposed with dynamic label assignment and attention-based decomposition.
A novel Adaptive Recognition Loss (ARL) offers guidance for the R-CNN head to focus on high-quality proposals.
arXiv Detail & Related papers (2023-12-16T18:04:56Z) - Refine, Discriminate and Align: Stealing Encoders via Sample-Wise Prototypes and Multi-Relational Extraction [57.16121098944589]
RDA is a pioneering approach designed to address two primary deficiencies prevalent in previous endeavors aiming at stealing pre-trained encoders.
It is accomplished via a sample-wise prototype, which consolidates the target encoder's representations for a given sample's various perspectives.
For more potent efficacy, we develop a multi-relational extraction loss that trains the surrogate encoder to Discriminate mismatched embedding-prototype pairs.
arXiv Detail & Related papers (2023-12-01T15:03:29Z) - UniInst: Unique Representation for End-to-End Instance Segmentation [29.974973664317485]
We propose a box-free and NMS-free end-to-end instance segmentation framework, termed UniInst.
Specifically, we design an instance-aware one-to-one assignment scheme, which dynamically assigns one unique representation to each instance.
With these techniques, our UniInst, the first FCN-based end-to-end instance segmentation framework, achieves competitive performance.
arXiv Detail & Related papers (2022-05-25T10:40:26Z) - ProposalCLIP: Unsupervised Open-Category Object Proposal Generation via
Exploiting CLIP Cues [49.88590455664064]
ProposalCLIP is able to predict proposals for a large variety of object categories without annotations.
ProposalCLIP also shows benefits for downstream tasks, such as unsupervised object detection.
arXiv Detail & Related papers (2022-01-18T01:51:35Z) - Adaptive Proposal Generation Network for Temporal Sentence Localization
in Videos [58.83440885457272]
We address the problem of temporal sentence localization in videos (TSLV)
Traditional methods follow a top-down framework which localizes the target segment with pre-defined segment proposals.
We propose an Adaptive Proposal Generation Network (APGN) to maintain the segment-level interaction while speeding up the efficiency.
arXiv Detail & Related papers (2021-09-14T02:02:36Z) - VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language
Matching [75.71523183166799]
The prevailing framework for matching multimodal inputs is based on a two-stage process.
We argue that these methods overlook an obvious emphmismatch between the roles of proposals in the two stages.
We propose VL-NMS, which is the first method to yield query-aware proposals at the first stage.
arXiv Detail & Related papers (2021-05-12T13:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.