ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments
- URL: http://arxiv.org/abs/2407.18550v1
- Date: Fri, 26 Jul 2024 07:00:27 GMT
- Title: ReALFRED: An Embodied Instruction Following Benchmark in Photo-Realistic Environments
- Authors: Taewoong Kim, Cheolhong Min, Byeonghwi Kim, Jinyeon Kim, Wonje Jeung, Jonghyun Choi,
- Abstract summary: We propose the ReALFRED benchmark that employs real-world scenes, objects, and room layouts to learn agents to complete household tasks.
Specifically, we extend the ALFRED benchmark with updates for larger environmental spaces with smaller visual domain gaps.
With ReALFRED, we analyze previously crafted methods for the ALFRED benchmark and observe that they consistently yield lower performance in all metrics.
- Score: 13.988804095409133
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Simulated virtual environments have been widely used to learn robotic agents that perform daily household tasks. These environments encourage research progress by far, but often provide limited object interactability, visual appearance different from real-world environments, or relatively smaller environment sizes. This prevents the learned models in the virtual scenes from being readily deployable. To bridge the gap between these learning environments and deploying (i.e., real) environments, we propose the ReALFRED benchmark that employs real-world scenes, objects, and room layouts to learn agents to complete household tasks by understanding free-form language instructions and interacting with objects in large, multi-room and 3D-captured scenes. Specifically, we extend the ALFRED benchmark with updates for larger environmental spaces with smaller visual domain gaps. With ReALFRED, we analyze previously crafted methods for the ALFRED benchmark and observe that they consistently yield lower performance in all metrics, encouraging the community to develop methods in more realistic environments. Our code and data are publicly available.
Related papers
- Scaling Instructable Agents Across Many Simulated Worlds [70.97268311053328]
Our goal is to develop an agent that can accomplish anything a human can do in any simulated 3D environment.
Our approach focuses on language-driven generality while imposing minimal assumptions.
Our agents interact with environments in real-time using a generic, human-like interface.
arXiv Detail & Related papers (2024-03-13T17:50:32Z) - Enhancing Graph Representation of the Environment through Local and
Cloud Computation [2.9465623430708905]
We propose a graph-based representation that provides a semantic representation of robot environments from multiple sources.
To acquire information from the environment, the framework combines classical computer vision tools with modern computer vision cloud services.
The proposed approach allows us to handle also small objects and integrate them into the semantic representation of the environment.
arXiv Detail & Related papers (2023-09-22T08:05:32Z) - ArK: Augmented Reality with Knowledge Interactive Emergent Ability [115.72679420999535]
We develop an infinite agent that learns to transfer knowledge memory from general foundation models to novel domains.
The heart of our approach is an emerging mechanism, dubbed Augmented Reality with Knowledge Inference Interaction (ArK)
We show that our ArK approach, combined with large foundation models, significantly improves the quality of generated 2D/3D scenes.
arXiv Detail & Related papers (2023-05-01T17:57:01Z) - CoDEPS: Online Continual Learning for Depth Estimation and Panoptic
Segmentation [28.782231314289174]
We introduce continual learning for deep learning-based monocular depth estimation and panoptic segmentation in new environments in an online manner.
We propose a novel domain-mixing strategy to generate pseudo-labels to adapt panoptic segmentation.
We explicitly address the limited storage capacity of robotic systems by leveraging sampling strategies for constructing a fixed-size replay buffer.
arXiv Detail & Related papers (2023-03-17T17:31:55Z) - RREx-BoT: Remote Referring Expressions with a Bag of Tricks [19.036557405184656]
We show how a vision-language scoring model can be used to locate objects in unobserved environments.
We demonstrate our model on a real-world TurtleBot platform, highlighting the simplicity and usefulness of the approach.
Our analysis outlines a "bag of tricks" essential for accomplishing this task, from utilizing 3d coordinates and context, to generalizing vision-language models to large 3d search spaces.
arXiv Detail & Related papers (2023-01-30T02:19:19Z) - Phone2Proc: Bringing Robust Robots Into Our Chaotic World [50.51598304564075]
Phone2Proc is a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes.
The generated scenes are conditioned on the wall layout and arrangement of large objects from the scan.
Phone2Proc shows massive improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav performance.
arXiv Detail & Related papers (2022-12-08T18:52:27Z) - Robot Active Neural Sensing and Planning in Unknown Cluttered
Environments [0.0]
Active sensing and planning in unknown, cluttered environments is an open challenge for robots intending to provide home service, search and rescue, narrow-passage inspection, and medical assistance.
We present the active neural sensing approach that generates the kinematically feasible viewpoint sequences for the robot manipulator with an in-hand camera to gather the minimum number of observations needed to reconstruct the underlying environment.
Our framework actively collects the visual RGBD observations, aggregates them into scene representation, and performs object shape inference to avoid unnecessary robot interactions with the environment.
arXiv Detail & Related papers (2022-08-23T16:56:54Z) - SILG: The Multi-environment Symbolic Interactive Language Grounding
Benchmark [62.34200575624785]
We propose the multi-environment Interactive Language Grounding benchmark (SILG)
SILG consists of grid-world environments that require generalization to new dynamics, entities, and partially observed worlds (RTFM, Messenger, NetHack)
We evaluate recent advances such as egocentric local convolution, recurrent state-tracking, entity-centric attention, and pretrained LM using SILG.
arXiv Detail & Related papers (2021-10-20T17:02:06Z) - Evaluating Continual Learning Algorithms by Generating 3D Virtual
Environments [66.83839051693695]
Continual learning refers to the ability of humans and animals to incrementally learn over time in a given environment.
We propose to leverage recent advances in 3D virtual environments in order to approach the automatic generation of potentially life-long dynamic scenes with photo-realistic appearance.
A novel element of this paper is that scenes are described in a parametric way, thus allowing the user to fully control the visual complexity of the input stream the agent perceives.
arXiv Detail & Related papers (2021-09-16T10:37:21Z) - iGibson, a Simulation Environment for Interactive Tasks in Large
Realistic Scenes [54.04456391489063]
iGibson is a novel simulation environment to develop robotic solutions for interactive tasks in large-scale realistic scenes.
Our environment contains fifteen fully interactive home-sized scenes populated with rigid and articulated objects.
iGibson features enable the generalization of navigation agents, and that the human-iGibson interface and integrated motion planners facilitate efficient imitation learning of simple human demonstrated behaviors.
arXiv Detail & Related papers (2020-12-05T02:14:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.