SIRI: Spatial Relation Induced Network For Spatial Description
Resolution
- URL: http://arxiv.org/abs/2010.14301v1
- Date: Tue, 27 Oct 2020 14:04:05 GMT
- Title: SIRI: Spatial Relation Induced Network For Spatial Description
Resolution
- Authors: Peiyao Wang, Weixin Luo, Yanyu Xu, Haojie Li, Shugong Xu, Jianyu Yang,
Shenghua Gao
- Abstract summary: We propose a novel relationship induced (SIRI) network for language-guided localization.
We show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius.
Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
- Score: 64.38872296406211
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spatial Description Resolution, as a language-guided localization task, is
proposed for target location in a panoramic street view, given corresponding
language descriptions. Explicitly characterizing an object-level relationship
while distilling spatial relationships are currently absent but crucial to this
task. Mimicking humans, who sequentially traverse spatial relationship words
and objects with a first-person view to locate their target, we propose a novel
spatial relationship induced (SIRI) network. Specifically, visual features are
firstly correlated at an implicit object-level in a projected latent space;
then they are distilled by each spatial relationship word, resulting in each
differently activated feature representing each spatial relationship. Further,
we introduce global position priors to fix the absence of positional
information, which may result in global positional reasoning ambiguities. Both
the linguistic and visual features are concatenated to finalize the target
localization. Experimental results on the Touchdown show that our method is
around 24\% better than the state-of-the-art method in terms of accuracy,
measured by an 80-pixel radius. Our method also generalizes well on our
proposed extended dataset collected using the same settings as Touchdown.
Related papers
- CurriculumLoc: Enhancing Cross-Domain Geolocalization through
Multi-Stage Refinement [11.108860387261508]
Visual geolocalization is a cost-effective and scalable task that involves matching one or more query images taken at some unknown location, to a set of geo-tagged reference images.
We develop CurriculumLoc, a novel keypoint detection and description with global semantic awareness and a local geometric verification.
We achieve new high recall@1 scores of 62.6% and 94.5% on ALTO, with two different distances metrics, respectively.
arXiv Detail & Related papers (2023-11-20T08:40:01Z) - LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts [107.11267074981905]
We propose a semantically controllable layout-AWare diffusion model, termed LAW-Diffusion.
We show that LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations.
arXiv Detail & Related papers (2023-08-13T08:06:18Z) - Semantic-guided modeling of spatial relation and object co-occurrence for indoor scene recognition [5.083140094792973]
SpaCoNet simultaneously models Spatial relation and Co-occurrence of objects guided by semantic segmentation.
Experimental results on three widely used scene datasets demonstrate the effectiveness and generality of the proposed method.
arXiv Detail & Related papers (2023-05-22T03:04:22Z) - Position-Aware Contrastive Alignment for Referring Image Segmentation [65.16214741785633]
We present a position-aware contrastive alignment network (PCAN) to enhance the alignment of multi-modal features.
Our PCAN consists of two modules: 1) Position Aware Module (PAM), which provides position information of all objects related to natural language descriptions, and 2) Contrastive Language Understanding Module (CLUM), which enhances multi-modal alignment.
arXiv Detail & Related papers (2022-12-27T09:13:19Z) - DenseGAP: Graph-Structured Dense Correspondence Learning with Anchor
Points [15.953570826460869]
Establishing dense correspondence between two images is a fundamental computer vision problem.
We introduce DenseGAP, a new solution for efficient Dense correspondence learning with a Graph-structured neural network conditioned on Anchor Points.
Our method advances the state-of-the-art of correspondence learning on most benchmarks.
arXiv Detail & Related papers (2021-12-13T18:59:30Z) - Spatial Language Understanding for Object Search in Partially Observed
Cityscale Environments [21.528770932332474]
We introduce the spatial language observation space and formulate a model under the framework of Partially Observable Markov Decision Process (POMDP)
We propose a convolutional neural network model that learns to predict the language provider's relative frame of reference (FoR) given environment context.
We demonstrate the generalizability of our FoR prediction model and object search system through cross-validation over areas of five cities, each with a 40,000m$2$ footprint.
arXiv Detail & Related papers (2020-12-04T16:27:59Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - Understanding Spatial Relations through Multiple Modalities [78.07328342973611]
spatial relations between objects can either be explicit -- expressed as spatial prepositions, or implicit -- expressed by spatial verbs such as moving, walking, shifting, etc.
We introduce the task of inferring implicit and explicit spatial relations between two entities in an image.
We design a model that uses both textual and visual information to predict the spatial relations, making use of both positional and size information of objects and image embeddings.
arXiv Detail & Related papers (2020-07-19T01:35:08Z) - Local-Global Video-Text Interactions for Temporal Grounding [77.5114709695216]
This paper addresses the problem of text-to-video temporal grounding, which aims to identify the time interval in a video semantically relevant to a text query.
We tackle this problem using a novel regression-based model that learns to extract a collection of mid-level features for semantic phrases in a text query.
The proposed method effectively predicts the target time interval by exploiting contextual information from local to global.
arXiv Detail & Related papers (2020-04-16T08:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.