Iterative Shrinking for Referring Expression Grounding Using Deep
Reinforcement Learning
- URL: http://arxiv.org/abs/2103.05187v1
- Date: Tue, 9 Mar 2021 02:36:45 GMT
- Title: Iterative Shrinking for Referring Expression Grounding Using Deep
Reinforcement Learning
- Authors: Mingjie Sun, Jimin Xiao, Eng Gee Lim
- Abstract summary: We are tackling the proposal-free referring expression grounding task, aiming at localizing the target object according to a query sentence.
Existing proposal-free methods employ a query-image matching branch to select the highest-score point in the image feature map as the target box center.
We propose an iterative shrinking mechanism to localize the target, where the shrinking direction is decided by a reinforcement learning agent.
- Score: 20.23920009396818
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we are tackling the proposal-free referring expression
grounding task, aiming at localizing the target object according to a query
sentence, without relying on off-the-shelf object proposals. Existing
proposal-free methods employ a query-image matching branch to select the
highest-score point in the image feature map as the target box center, with its
width and height predicted by another branch. Such methods, however, fail to
utilize the contextual relation between the target and reference objects, and
lack interpretability on its reasoning procedure. To solve these problems, we
propose an iterative shrinking mechanism to localize the target, where the
shrinking direction is decided by a reinforcement learning agent, with all
contents within the current image patch comprehensively considered. Beside, the
sequential shrinking process enables to demonstrate the reasoning about how to
iteratively find the target. Experiments show that the proposed method boosts
the accuracy by 4.32% against the previous state-of-the-art (SOTA) method on
the RefCOCOg dataset, where query sentences are long and complex, with many
targets referred by other reference objects.
Related papers
- Boosting Weakly-Supervised Referring Image Segmentation via Progressive Comprehension [40.21084218601082]
This paper focuses on a challenging setup where target localization is learned directly from image-text pairs.
We propose a novel Progressive Network (PCNet) to leverage target-related textual cues for progressively localizing the target object.
Our method outperforms SOTA methods on three common benchmarks.
arXiv Detail & Related papers (2024-10-02T13:30:32Z) - Mutually-Aware Feature Learning for Few-Shot Object Counting [20.623402944601775]
Few-shot object counting has garnered significant attention for its practicality as it aims to count target objects in a query image based on given exemplars without the need for additional training.
We propose a novel framework, Mutually-Aware FEAture learning(MAFEA), which encodes query and exemplar features mutually aware of each other from the outset.
Our model reaches a new state-of-the-art performance on the two challenging benchmarks, FSCD-LVIS and FSC-147, with a remarkably reduced degree of the target confusion problem.
arXiv Detail & Related papers (2024-08-19T06:46:24Z) - Revisiting Proposal-based Object Detection [59.97295544455179]
We revisit the pipeline for detecting objects in images with proposals.
We solve a simple problem where we regress to the area of intersection between proposal and ground truth.
Our revisited approach comes with minimal changes to the detection pipeline and can be plugged into any existing method.
arXiv Detail & Related papers (2023-11-30T12:40:23Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Guiding Computational Stance Detection with Expanded Stance Triangle
Framework [25.2980607215715]
Stance detection determines whether the author of a piece of text is in favor of, against, or neutral towards a specified target.
We decompose the stance detection task from a linguistic perspective, and investigate key components and inference paths in this task.
arXiv Detail & Related papers (2023-05-31T13:33:29Z) - Fusing Local Similarities for Retrieval-based 3D Orientation Estimation
of Unseen Objects [70.49392581592089]
We tackle the task of estimating the 3D orientation of previously-unseen objects from monocular images.
We follow a retrieval-based strategy and prevent the network from learning object-specific features.
Our experiments on the LineMOD, LineMOD-Occluded, and T-LESS datasets show that our method yields a significantly better generalization to unseen objects than previous works.
arXiv Detail & Related papers (2022-03-16T08:53:00Z) - Mixup-CAM: Weakly-supervised Semantic Segmentation via Uncertainty
Regularization [73.03956876752868]
We propose a principled and end-to-end train-able framework to allow the network to pay attention to other parts of the object.
Specifically, we introduce the mixup data augmentation scheme into the classification network and design two uncertainty regularization terms to better interact with the mixup strategy.
arXiv Detail & Related papers (2020-08-03T21:19:08Z) - Weakly-Supervised Semantic Segmentation via Sub-category Exploration [73.03956876752868]
We propose a simple yet effective approach to enforce the network to pay attention to other parts of an object.
Specifically, we perform clustering on image features to generate pseudo sub-categories labels within each annotated parent class.
We conduct extensive analysis to validate the proposed method and show that our approach performs favorably against the state-of-the-art approaches.
arXiv Detail & Related papers (2020-08-03T20:48:31Z) - Point-Set Anchors for Object Detection, Instance Segmentation and Pose
Estimation [85.96410825961966]
We argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries.
To facilitate inference, we propose to instead perform regression from a set of points placed at more advantageous positions.
We apply this proposed framework, called Point-Set Anchors, to object detection, instance segmentation, and human pose estimation.
arXiv Detail & Related papers (2020-07-06T15:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.