Related papers: Are We There Yet? Learning to Localize in Embodied Instruction Following

Are We There Yet? Learning to Localize in Embodied Instruction Following

URL: http://arxiv.org/abs/2101.03431v1
Date: Sat, 9 Jan 2021 21:49:41 GMT
Title: Are We There Yet? Learning to Localize in Embodied Instruction Following
Authors: Shane Storks, Qiaozi Gao, Govind Thattai, Gokhan Tur
Abstract summary: Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem. Key challenges for this task include localizing target locations and navigating to them through visual inputs. We augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep.
Score: 1.7300690315775575
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Embodied instruction following is a challenging problem requiring an agent to infer a sequence of primitive actions to achieve a goal environment state from complex language and visual inputs. Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem consisting of step-by-step natural language instructions to achieve subgoals which compose to an ultimate high-level goal. Key challenges for this task include localizing target locations and navigating to them through visual inputs, and grounding language instructions to visual appearance of objects. To address these challenges, in this study, we augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep. We also improve language grounding by introducing a pre-trained object detection module to the model pipeline. Empirical studies show that our approach exceeds the baseline model performance.

Related papers

Referencing Where to Focus: Improving VisualGrounding with Referential Query [30.33315985826623]
We propose a novel visual grounding method called RefFormer. It consists of the query adaption module that can be seamlessly integrated into CLIP. Our proposed query adaption module can also act as an adapter, preserving the rich knowledge within CLIP without the need to tune the parameters of the backbone network.
arXiv Detail & Related papers (2024-12-26T10:19:20Z)
Visual Grounding for Object-Level Generalization in Reinforcement Learning [35.39214541324909]
Generalization is a pivotal challenge for agents following natural language instructions. We leverage a vision-language model (VLM) for visual grounding and transfer its vision-language knowledge into reinforcement learning. We show that our intrinsic reward significantly improves performance on challenging skill learning.
arXiv Detail & Related papers (2024-08-04T06:34:24Z)
Embodied Instruction Following in Unknown Environments [66.60163202450954]
We propose an embodied instruction following (EIF) method for complex tasks in the unknown environment. We build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues.
arXiv Detail & Related papers (2024-06-17T17:55:40Z)
Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually. We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions. We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z)
How To Not Train Your Dragon: Training-free Embodied Object Goal Navigation with Semantic Frontiers [94.46825166907831]
We present a training-free solution to tackle the object goal navigation problem in Embodied AI. Our method builds a structured scene representation based on the classic visual simultaneous localization and mapping (V-SLAM) framework. Our method propagates semantics on the scene graphs based on language priors and scene statistics to introduce semantic knowledge to the geometric frontiers.
arXiv Detail & Related papers (2023-05-26T13:38:33Z)
ARNOLD: A Benchmark for Language-Grounded Task Learning With Continuous States in Realistic 3D Scenes [72.83187997344406]
ARNOLD is a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD is comprised of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals.
arXiv Detail & Related papers (2023-04-09T21:42:57Z)
Joint Visual Grounding and Tracking with Natural Language Specification [6.695284124073918]
Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description. We propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task. Our method performs favorably against state-of-the-art algorithms for both tracking and grounding.
arXiv Detail & Related papers (2023-03-21T17:09:03Z)
Structured Exploration Through Instruction Enhancement for Object Navigation [0.0]
We propose a hierarchical learning-based method for object navigation. The top-level is capable of high-level planning, and building a memory on a floorplan-level. We demonstrate the effectiveness of our method on a dynamic domestic environment.
arXiv Detail & Related papers (2022-11-15T19:39:22Z)
Compositional Generalization in Grounded Language Learning via Induced Model Sparsity [81.38804205212425]
We consider simple language-conditioned navigation problems in a grid world environment with disentangled observations. We design an agent that encourages sparse correlations between words in the instruction and attributes of objects, composing them together to find the goal. Our agent maintains a high level of performance on goals containing novel combinations of properties even when learning from a handful of demonstrations.
arXiv Detail & Related papers (2022-07-06T08:46:27Z)
Learning to Map for Active Semantic Goal Navigation [40.193928212509356]
We propose a novel framework that actively learns to generate semantic maps outside the field of view of the agent. We show how different objectives can be defined by balancing exploration with exploitation. Our method is validated in the visually realistic environments offered by the Matterport3D dataset.
arXiv Detail & Related papers (2021-06-29T18:01:30Z)
Object Goal Navigation using Goal-Oriented Semantic Exploration [98.14078233526476]
This work studies the problem of object goal navigation which involves navigating to an instance of the given object category in unseen environments. We propose a modular system called, Goal-Oriented Semantic Exploration' which builds an episodic semantic map and uses it to explore the environment efficiently.
arXiv Detail & Related papers (2020-07-01T17:52:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.