Suspected Object Matters: Rethinking Model's Prediction for One-stage
Visual Grounding
- URL: http://arxiv.org/abs/2203.05186v2
- Date: Mon, 21 Aug 2023 10:31:12 GMT
- Title: Suspected Object Matters: Rethinking Model's Prediction for One-stage
Visual Grounding
- Authors: Yang Jiao, Zequn Jie, Jingjing Chen, Lin Ma, Yu-Gang Jiang
- Abstract summary: We propose a Suspected Object Transformation mechanism (SOT) to encourage the target object selection among the suspected ones.
SOT can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders.
Extensive experiments demonstrate the effectiveness of our proposed method.
- Score: 93.82542533426766
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, one-stage visual grounders attract high attention due to their
comparable accuracy but significantly higher efficiency than two-stage
grounders. However, inter-object relation modeling has not been well studied
for one-stage grounders. Inter-object relationship modeling, though important,
is not necessarily performed among all objects, as only part of them are
related to the text query and may confuse the model. We call these objects
suspected objects. However, exploring their relationships in the one-stage
paradigm is non-trivial because: First, no object proposals are available as
the basis on which to select suspected objects and perform relationship
modeling. Second, suspected objects are more confusing than others, as they may
share similar semantics, be entangled with certain relationships, etc, and
thereby more easily mislead the model prediction. Toward this end, we propose a
Suspected Object Transformation mechanism (SOT), which can be seamlessly
integrated into existing CNN and Transformer-based one-stage visual grounders
to encourage the target object selection among the suspected ones. Suspected
objects are dynamically discovered from a learned activation map adapted to the
model current discrimination ability during training. Afterward, on top of
suspected objects, a Keyword-Aware Discrimination module (KAD) and an
Exploration by Random Connection strategy (ERC) are concurrently proposed to
help the model rethink its initial prediction. On the one hand, KAD leverages
keywords contributing high to suspected object discrimination. On the other
hand, ERC allows the model to seek the correct object instead of being trapped
in a situation that always exploits the current false prediction. Extensive
experiments demonstrate the effectiveness of our proposed method.
Related papers
- DeTra: A Unified Model for Object Detection and Trajectory Forecasting [68.85128937305697]
Our approach formulates the union of the two tasks as a trajectory refinement problem.
To tackle this unified task, we design a refinement transformer that infers the presence, pose, and multi-modal future behaviors of objects.
In our experiments, we observe that ourmodel outperforms the state-of-the-art on Argoverse 2 Sensor and Open dataset.
arXiv Detail & Related papers (2024-06-06T18:12:04Z) - H-SAUR: Hypothesize, Simulate, Act, Update, and Repeat for Understanding
Object Articulations from Interactions [62.510951695174604]
"Hypothesize, Simulate, Act, Update, and Repeat" (H-SAUR) is a probabilistic generative framework that generates hypotheses about how objects articulate given input observations.
We show that the proposed model significantly outperforms the current state-of-the-art articulated object manipulation framework.
We further improve the test-time efficiency of H-SAUR by integrating a learned prior from learning-based vision models.
arXiv Detail & Related papers (2022-10-22T18:39:33Z) - VReBERT: A Simple and Flexible Transformer for Visual Relationship
Detection [0.30458514384586394]
We propose a BERT-like transformer model for Visual Relationship Detection with a multi-stage training strategy.
We show that our simple BERT-like model is able to outperform the state-of-the-art VRD models in predicate prediction.
arXiv Detail & Related papers (2022-06-18T04:08:19Z) - SOS! Self-supervised Learning Over Sets Of Handled Objects In Egocentric
Action Recognition [35.4163266882568]
We introduce Self-Supervised Learning Over Sets (SOS) to pre-train a generic Objects In Contact (OIC) representation model.
Our OIC significantly boosts the performance of multiple state-of-the-art video classification models.
arXiv Detail & Related papers (2022-04-10T23:27:19Z) - KINet: Unsupervised Forward Models for Robotic Pushing Manipulation [8.572983995175909]
We introduce KINet -- an unsupervised framework to reason about object interactions based on a keypoint representation.
Our model learns to associate objects with keypoint coordinates and discovers a graph representation of the system.
By learning to perform physical reasoning in the keypoint space, our model automatically generalizes to scenarios with a different number of objects.
arXiv Detail & Related papers (2022-02-18T03:32:08Z) - Detecting Human-Object Interactions with Object-Guided Cross-Modal
Calibrated Semantics [6.678312249123534]
We aim to boost end-to-end models with object-guided statistical priors.
We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy.
The above modules combined composes Object-guided Cross-modal Network (OCN)
arXiv Detail & Related papers (2022-02-01T07:39:04Z) - Weakly-Supervised Video Object Grounding via Causal Intervention [82.68192973503119]
We target at the task of weakly-supervised video object grounding (WSVOG), where only video-sentence annotations are available during model learning.
It aims to localize objects described in the sentence to visual regions in the video, which is a fundamental capability needed in pattern analysis and machine learning.
arXiv Detail & Related papers (2021-12-01T13:13:03Z) - Instance-Level Relative Saliency Ranking with Graph Reasoning [126.09138829920627]
We present a novel unified model to segment salient instances and infer relative saliency rank order.
A novel loss function is also proposed to effectively train the saliency ranking branch.
experimental results demonstrate that our proposed model is more effective than previous methods.
arXiv Detail & Related papers (2021-07-08T13:10:42Z) - Synthesizing the Unseen for Zero-shot Object Detection [72.38031440014463]
We propose to synthesize visual features for unseen classes, so that the model learns both seen and unseen objects in the visual domain.
We use a novel generative model that uses class-semantics to not only generate the features but also to discriminatively separate them.
arXiv Detail & Related papers (2020-10-19T12:36:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.