Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression
Grounding
- URL: http://arxiv.org/abs/2009.01449v3
- Date: Wed, 10 Mar 2021 01:25:59 GMT
- Title: Ref-NMS: Breaking Proposal Bottlenecks in Two-Stage Referring Expression
Grounding
- Authors: Long Chen, Wenbo Ma, Jun Xiao, Hanwang Zhang, Shih-Fu Chang
- Abstract summary: Ref-NMS is the first method to yield expression-aware proposals at the first stage.
Ref-NMS regards all nouns in the expression as critical objects, and introduces a lightweight module to predict a score for aligning each box with a critical object.
Since Ref- NMS is agnostic to the grounding step, it can be easily integrated into any state-of-the-art two-stage method.
- Score: 80.46288064284084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The prevailing framework for solving referring expression grounding is based
on a two-stage process: 1) detecting proposals with an object detector and 2)
grounding the referent to one of the proposals. Existing two-stage solutions
mostly focus on the grounding step, which aims to align the expressions with
the proposals. In this paper, we argue that these methods overlook an obvious
mismatch between the roles of proposals in the two stages: they generate
proposals solely based on the detection confidence (i.e., expression-agnostic),
hoping that the proposals contain all right instances in the expression (i.e.,
expression-aware). Due to this mismatch, current two-stage methods suffer from
a severe performance drop between detected and ground-truth proposals. To this
end, we propose Ref-NMS, which is the first method to yield expression-aware
proposals at the first stage. Ref-NMS regards all nouns in the expression as
critical objects, and introduces a lightweight module to predict a score for
aligning each box with a critical object. These scores can guide the NMS
operation to filter out the boxes irrelevant to the expression, increasing the
recall of critical objects, resulting in a significantly improved grounding
performance. Since Ref- NMS is agnostic to the grounding step, it can be easily
integrated into any state-of-the-art two-stage method. Extensive ablation
studies on several backbones, benchmarks, and tasks consistently demonstrate
the superiority of Ref-NMS. Codes are available at:
https://github.com/ChopinSharp/ref-nms.
Related papers
- Joint Top-Down and Bottom-Up Frameworks for 3D Visual Grounding [28.55989894411032]
This paper tackles the challenging task of 3D visual grounding-locating a specific object in a 3D point cloud scene based on text descriptions.
Existing methods fall into two categories: top-down and bottom-up methods.
We propose a joint top-down and bottom-up framework, aiming to enhance the performance while improving the efficiency.
arXiv Detail & Related papers (2024-10-21T03:33:13Z) - Detection-based Intermediate Supervision for Visual Question Answering [13.96848991623376]
We propose a generative detection framework to facilitate multiple grounding supervisions via sequence generation.
Our proposed DIS offers more comprehensive and accurate intermediate supervisions, thereby boosting answer prediction performance.
Extensive experiments demonstrate the superiority of our proposed DIS, showcasing both improved accuracy and state-of-the-art reasoning consistency.
arXiv Detail & Related papers (2023-12-26T11:45:22Z) - Revisiting Proposal-based Object Detection [59.97295544455179]
We revisit the pipeline for detecting objects in images with proposals.
We solve a simple problem where we regress to the area of intersection between proposal and ground truth.
Our revisited approach comes with minimal changes to the detection pipeline and can be plugged into any existing method.
arXiv Detail & Related papers (2023-11-30T12:40:23Z) - ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object
Detection [114.54835359657707]
ProposalContrast is an unsupervised point cloud pre-training framework.
It learns robust 3D representations by contrasting region proposals.
ProposalContrast is verified on various 3D detectors.
arXiv Detail & Related papers (2022-07-26T04:45:49Z) - Plug-and-Play Few-shot Object Detection with Meta Strategy and Explicit
Localization Inference [78.41932738265345]
This paper proposes a plug detector that can accurately detect the objects of novel categories without fine-tuning process.
We introduce two explicit inferences into the localization process to reduce its dependence on annotated data.
It shows a significant lead in both efficiency, precision, and recall under varied evaluation protocols.
arXiv Detail & Related papers (2021-10-26T03:09:57Z) - Contrastive Proposal Extension with LSTM Network for Weakly Supervised
Object Detection [52.86681130880647]
Weakly supervised object detection (WSOD) has attracted more and more attention since it only uses image-level labels and can save huge annotation costs.
We propose a new method by comparing the initial proposals and the extension ones to optimize those initial proposals.
Experiments on PASCAL VOC 2007, VOC 2012 and MS-COCO datasets show that our method has achieved the state-of-the-art results.
arXiv Detail & Related papers (2021-10-14T16:31:57Z) - Natural Language Video Localization with Learnable Moment Proposals [40.91060659795612]
We propose a novel model termed LPNet (Learnable Proposal Network for NLVL) with a fixed set of learnable moment proposals.
In this paper, we demonstrate the effectiveness of LPNet over existing state-of-the-art methods.
arXiv Detail & Related papers (2021-09-22T12:18:58Z) - VL-NMS: Breaking Proposal Bottlenecks in Two-Stage Visual-Language
Matching [75.71523183166799]
The prevailing framework for matching multimodal inputs is based on a two-stage process.
We argue that these methods overlook an obvious emphmismatch between the roles of proposals in the two stages.
We propose VL-NMS, which is the first method to yield query-aware proposals at the first stage.
arXiv Detail & Related papers (2021-05-12T13:05:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.