Scene-Intuitive Agent for Remote Embodied Visual Grounding
- URL: http://arxiv.org/abs/2103.12944v1
- Date: Wed, 24 Mar 2021 02:37:48 GMT
- Title: Scene-Intuitive Agent for Remote Embodied Visual Grounding
- Authors: Xiangru Lin, Guanbin Li, Yizhou Yu
- Abstract summary: Humans learn from life events to form intuitions towards the understanding of visual environments and languages.
We present an agent that mimics such human behaviors.
- Score: 89.73786309180139
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans learn from life events to form intuitions towards the understanding of
visual environments and languages. Envision that you are instructed by a
high-level instruction, "Go to the bathroom in the master bedroom and replace
the blue towel on the left wall", what would you possibly do to carry out the
task? Intuitively, we comprehend the semantics of the instruction to form an
overview of where a bathroom is and what a blue towel is in mind; then, we
navigate to the target location by consistently matching the bathroom
appearance in mind with the current scene. In this paper, we present an agent
that mimics such human behaviors. Specifically, we focus on the Remote Embodied
Visual Referring Expression in Real Indoor Environments task, called REVERIE,
where an agent is asked to correctly localize a remote target object specified
by a concise high-level natural language instruction, and propose a two-stage
training pipeline. In the first stage, we pretrain the agent with two
cross-modal alignment sub-tasks, namely the Scene Grounding task and the Object
Grounding task. The agent learns where to stop in the Scene Grounding task and
what to attend to in the Object Grounding task respectively. Then, to generate
action sequences, we propose a memory-augmented attentive action decoder to
smoothly fuse the pre-trained vision and language representations with the
agent's past memory experiences. Without bells and whistles, experimental
results show that our method outperforms previous state-of-the-art(SOTA)
significantly, demonstrating the effectiveness of our method.
Related papers
- Visual Grounding for Object-Level Generalization in Reinforcement Learning [35.39214541324909]
Generalization is a pivotal challenge for agents following natural language instructions.
We leverage a vision-language model (VLM) for visual grounding and transfer its vision-language knowledge into reinforcement learning.
We show that our intrinsic reward significantly improves performance on challenging skill learning.
arXiv Detail & Related papers (2024-08-04T06:34:24Z) - DISCO: Embodied Navigation and Interaction via Differentiable Scene Semantics and Dual-level Control [53.80518003412016]
Building a general-purpose intelligent home-assistant agent skilled in diverse tasks by human commands is a long-term blueprint of embodied AI research.
We study primitive mobile manipulations for embodied agents, i.e. how to navigate and interact based on an instructed verb-noun pair.
We propose DISCO, which features non-trivial advancements in contextualized scene modeling and efficient controls.
arXiv Detail & Related papers (2024-07-20T05:39:28Z) - Situated Instruction Following [87.37244711380411]
We propose situated instruction following, which embraces the inherent underspecification and ambiguity of real-world communication.
The meaning of situated instructions naturally unfold through the past actions and the expected future behaviors of the human involved.
Our experiments indicate that state-of-the-art Embodied Instruction Following (EIF) models lack holistic understanding of situated human intention.
arXiv Detail & Related papers (2024-07-15T19:32:30Z) - Embodied Instruction Following in Unknown Environments [66.60163202450954]
We propose an embodied instruction following (EIF) method for complex tasks in the unknown environment.
We build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller.
For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues.
arXiv Detail & Related papers (2024-06-17T17:55:40Z) - ThinkBot: Embodied Instruction Following with Thought Chain Reasoning [66.09880459084901]
Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments.
We propose ThinkBot that reasons the thought chain in human instruction to recover the missing action descriptions.
Our ThinkBot outperforms the state-of-the-art EIF methods by a sizable margin in both success rate and execution efficiency.
arXiv Detail & Related papers (2023-12-12T08:30:09Z) - Distilling Internet-Scale Vision-Language Models into Embodied Agents [24.71298634838615]
We propose using pretrained vision-language models (VLMs) to supervise embodied agents.
We combine ideas from model distillation and hindsight experience replay (HER) to retroactively generate language describing the agent's behavior.
Our work outlines a new and effective way to use internet-scale VLMs, repurposing the generic language grounding acquired by such models to teach task-relevant groundings to embodied agents.
arXiv Detail & Related papers (2023-01-29T18:21:05Z) - Layout-aware Dreamer for Embodied Referring Expression Grounding [49.33508853581283]
We study the problem of Embodied Referring Expression Grounding, where an agent needs to navigate in a previously unseen environment.
We have designed an autonomous agent called Layout-aware Dreamer (LAD)
LAD learns to infer the room category distribution of neighboring unexplored areas along the path for coarse layout estimation.
To learn an effective exploration of the environment, the Goal Dreamer imagines the destination beforehand.
arXiv Detail & Related papers (2022-11-30T23:36:17Z) - Structured Exploration Through Instruction Enhancement for Object
Navigation [0.0]
We propose a hierarchical learning-based method for object navigation.
The top-level is capable of high-level planning, and building a memory on a floorplan-level.
We demonstrate the effectiveness of our method on a dynamic domestic environment.
arXiv Detail & Related papers (2022-11-15T19:39:22Z) - TIDEE: Tidying Up Novel Rooms using Visuo-Semantic Commonsense Priors [29.255373211228548]
TIDEE tidies up a disordered scene based on learned commonsense object placement and room arrangement priors.
TIDEE explores a home environment, detects objects that are out of their natural place, infers plausible object contexts for them, localizes such contexts in the current scene, and repositions the objects.
We test TIDEE on tidying up disorganized scenes in the AI2THOR simulation environment.
arXiv Detail & Related papers (2022-07-21T21:19:18Z) - Are We There Yet? Learning to Localize in Embodied Instruction Following [1.7300690315775575]
Action Learning From Realistic Environments and Directives (ALFRED) is a recently proposed benchmark for this problem.
Key challenges for this task include localizing target locations and navigating to them through visual inputs.
We augment the agent's field of view during navigation subgoals with multiple viewing angles, and train the agent to predict its relative spatial relation to the target location at each timestep.
arXiv Detail & Related papers (2021-01-09T21:49:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.