Spatial Language Understanding for Object Search in Partially Observed
Cityscale Environments
- URL: http://arxiv.org/abs/2012.02705v1
- Date: Fri, 4 Dec 2020 16:27:59 GMT
- Title: Spatial Language Understanding for Object Search in Partially Observed
Cityscale Environments
- Authors: Kaiyu Zheng, Deniz Bayazit, Rebecca Mathew, Ellie Pavlick, Stefanie
Tellex
- Abstract summary: We introduce the spatial language observation space and formulate a model under the framework of Partially Observable Markov Decision Process (POMDP)
We propose a convolutional neural network model that learns to predict the language provider's relative frame of reference (FoR) given environment context.
We demonstrate the generalizability of our FoR prediction model and object search system through cross-validation over areas of five cities, each with a 40,000m$2$ footprint.
- Score: 21.528770932332474
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a system that enables robots to interpret spatial language as a
distribution over object locations for effective search in partially observable
cityscale environments. We introduce the spatial language observation space and
formulate a stochastic observation model under the framework of Partially
Observable Markov Decision Process (POMDP) which incorporates information
extracted from the spatial language into the robot's belief. To interpret
ambiguous, context-dependent prepositions (e.g.~front), we propose a
convolutional neural network model that learns to predict the language
provider's relative frame of reference (FoR) given environment context. We
demonstrate the generalizability of our FoR prediction model and object search
system through cross-validation over areas of five cities, each with a
40,000m$^2$ footprint. End-to-end experiments in simulation show that our
system achieves faster search and higher success rate compared to a
keyword-based baseline without spatial preposition understanding.
Related papers
- Structured Spatial Reasoning with Open Vocabulary Object Detectors [2.089191490381739]
Reasoning about spatial relationships between objects is essential for many real-world robotic tasks.
We introduce a structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors.
The approach is evaluated and compared against zero-shot performance of the state-of-the-art Vision and Language Models (VLMs) on spatial reasoning tasks.
arXiv Detail & Related papers (2024-10-09T19:37:01Z) - Reframing Spatial Reasoning Evaluation in Language Models: A Real-World Simulation Benchmark for Qualitative Reasoning [4.422649561583363]
We present a novel benchmark for assessing spatial reasoning in language models (LMs)
It is grounded in realistic 3D simulation data, offering a series of diverse room layouts with various objects and their spatial relationships.
A key contribution is our logic-based consistency-checking tool, which enables the assessment of multiple plausible solutions.
arXiv Detail & Related papers (2024-05-23T21:22:00Z) - Navigation with Large Language Models: Semantic Guesswork as a Heuristic
for Planning [73.0990339667978]
Navigation in unfamiliar environments presents a major challenge for robots.
We use language models to bias exploration of novel real-world environments.
We evaluate LFG in challenging real-world environments and simulated benchmarks.
arXiv Detail & Related papers (2023-10-16T06:21:06Z) - Language-EXtended Indoor SLAM (LEXIS): A Versatile System for Real-time
Visual Scene Understanding [0.0]
LEXIS is a real-time indoor Simultaneous Localization and Mapping system.
It harnesses the open-vocabulary nature of Large Language Models to create a unified approach to scene understanding and place recognition.
It successfully categorizes rooms with varying layouts and dimensions and outperforms the state-of-the-art (SOTA)
arXiv Detail & Related papers (2023-09-26T16:50:20Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - SIRI: Spatial Relation Induced Network For Spatial Description
Resolution [64.38872296406211]
We propose a novel relationship induced (SIRI) network for language-guided localization.
We show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius.
Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
arXiv Detail & Related papers (2020-10-27T14:04:05Z) - Learning Universal Representations from Word to Sentence [89.82415322763475]
This work introduces and explores the universal representation learning, i.e., embeddings of different levels of linguistic unit in a uniform vector space.
We present our approach of constructing analogy datasets in terms of words, phrases and sentences.
We empirically verify that well pre-trained Transformer models incorporated with appropriate training settings may effectively yield universal representation.
arXiv Detail & Related papers (2020-09-10T03:53:18Z) - From Spatial Relations to Spatial Configurations [64.21025426604274]
spatial relation language is able to represent a large, comprehensive set of spatial concepts crucial for reasoning.
We show how we extend the capabilities of existing spatial representation languages with the fine-grained decomposition of semantics.
arXiv Detail & Related papers (2020-07-19T02:11:53Z) - Benchmarking Unsupervised Object Representations for Video Sequences [111.81492107649889]
We compare the perceptual abilities of four object-centric approaches: ViMON, OP3, TBA and SCALOR.
Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking.
Our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
arXiv Detail & Related papers (2020-06-12T09:37:24Z) - Robust and Interpretable Grounding of Spatial References with Relation
Networks [40.42540299023808]
Learning representations of spatial references in natural language is a key challenge in tasks like autonomous navigation and robotic manipulation.
Recent work has investigated various neural architectures for learning multi-modal representations for spatial concepts.
We develop effective models for understanding spatial references in text that are robust and interpretable.
arXiv Detail & Related papers (2020-05-02T04:11:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.