Where Do We Go from Here? Multi-scale Allocentric Relational Inference
from Natural Spatial Descriptions
- URL: http://arxiv.org/abs/2402.16364v1
- Date: Mon, 26 Feb 2024 07:33:28 GMT
- Title: Where Do We Go from Here? Multi-scale Allocentric Relational Inference
from Natural Spatial Descriptions
- Authors: Tzuf Paz-Argaman, Sayali Kulkarni, John Palowitch, Jason Baldridge,
and Reut Tsarfaty
- Abstract summary: This paper introduces the Rendezvous (RVS) task and dataset, which includes 10,404 examples of English geospatial instructions for reaching a target location using map-knowledge.
Our analysis reveals that RVS exhibits a richer use of spatial allocentric relations, and requires resolving more spatial relations simultaneously compared to previous text-based navigation benchmarks.
- Score: 18.736071151303726
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: When communicating routes in natural language, the concept of {\em acquired
spatial knowledge} is crucial for geographic information retrieval (GIR) and in
spatial cognitive research. However, NLP navigation studies often overlook the
impact of such acquired knowledge on textual descriptions. Current navigation
studies concentrate on egocentric local descriptions (e.g., `it will be on your
right') that require reasoning over the agent's local perception. These
instructions are typically given as a sequence of steps, with each action-step
explicitly mentioning and being followed by a landmark that the agent can use
to verify they are on the right path (e.g., `turn right and then you will
see...'). In contrast, descriptions based on knowledge acquired through a map
provide a complete view of the environment and capture its overall structure.
These instructions (e.g., `it is south of Central Park and a block north of a
police station') are typically non-sequential, contain allocentric relations,
with multiple spatial relations and implicit actions, without any explicit
verification. This paper introduces the Rendezvous (RVS) task and dataset,
which includes 10,404 examples of English geospatial instructions for reaching
a target location using map-knowledge. Our analysis reveals that RVS exhibits a
richer use of spatial allocentric relations, and requires resolving more
spatial relations simultaneously compared to previous text-based navigation
benchmarks.
Related papers
- Into the Unknown: Generating Geospatial Descriptions for New Environments [18.736071151303726]
Rendezvous task requires reasoning over allocentric spatial relationships.
Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text.
We propose a large-scale augmentation method for generating high-quality synthetic data for new environments.
arXiv Detail & Related papers (2024-06-28T14:56:21Z) - Hierarchical Spatial Proximity Reasoning for Vision-and-Language Navigation [1.2473780585666772]
Most Vision-and-Language Navigation (VLN) algorithms tend to make decision errors, primarily due to a lack of visual common sense and insufficient reasoning capabilities.
This paper proposes a Hierarchical Spatial Proximity Reasoning (HSPR) model to address this issue.
We conduct experiments on publicly available datasets including REVERIE, SOON, R2R, and R4R to validate the effectiveness of the proposed approach.
arXiv Detail & Related papers (2024-03-18T07:51:22Z) - GeoGLUE: A GeoGraphic Language Understanding Evaluation Benchmark [56.08664336835741]
We propose a GeoGraphic Language Understanding Evaluation benchmark, named GeoGLUE.
We collect data from open-released geographic resources and introduce six natural language understanding tasks.
We pro vide evaluation experiments and analysis of general baselines, indicating the effectiveness and significance of the GeoGLUE benchmark.
arXiv Detail & Related papers (2023-05-11T03:21:56Z) - KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation [61.08389704326803]
Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes.
Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates.
We propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability.
arXiv Detail & Related papers (2023-03-28T08:00:46Z) - Weakly-Supervised Multi-Granularity Map Learning for Vision-and-Language
Navigation [87.52136927091712]
We address a practical yet challenging problem of training robot agents to navigate in an environment following a path described by some language instructions.
To achieve accurate and efficient navigation, it is critical to build a map that accurately represents both spatial location and the semantic information of the environment objects.
We propose a multi-granularity map, which contains both object fine-grained details (e.g., color, texture) and semantic classes, to represent objects more comprehensively.
arXiv Detail & Related papers (2022-10-14T04:23:27Z) - What is Right for Me is Not Yet Right for You: A Dataset for Grounding
Relative Directions via Multi-Task Learning [16.538887534958555]
We investigate the problem of grounding relative directions with end-to-end neural networks.
GRiD-3D is a novel dataset that features relative directions and complements existing visual question answering (VQA) datasets.
We discover that those subtasks are learned in an order that reflects the steps of an intuitive pipeline for processing relative directions.
arXiv Detail & Related papers (2022-05-05T14:25:46Z) - SOON: Scenario Oriented Object Navigation with Graph-based Exploration [102.74649829684617]
The ability to navigate like a human towards a language-guided target from anywhere in a 3D embodied environment is one of the 'holy grail' goals of intelligent robots.
Most visual navigation benchmarks focus on navigating toward a target from a fixed starting point, guided by an elaborate set of instructions that depicts step-by-step.
This approach deviates from real-world problems in which human-only describes what the object and its surrounding look like and asks the robot to start navigation from anywhere.
arXiv Detail & Related papers (2021-03-31T15:01:04Z) - SIRI: Spatial Relation Induced Network For Spatial Description
Resolution [64.38872296406211]
We propose a novel relationship induced (SIRI) network for language-guided localization.
We show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius.
Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
arXiv Detail & Related papers (2020-10-27T14:04:05Z) - From Topic Networks to Distributed Cognitive Maps: Zipfian Topic
Universes in the Area of Volunteered Geographic Information [59.0235296929395]
We investigate how language encodes and networks geographic information on the aboutness level of texts.
Our study shows a Zipfian organization of the thematic universe in which geographical places are located in online communication.
Places, whether close to each other or not, are located in neighboring places that span similarworks in the topic universe.
arXiv Detail & Related papers (2020-02-04T18:31:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.