Improving the Robustness to Variations of Objects and Instructions with
a Neuro-Symbolic Approach for Interactive Instruction Following
- URL: http://arxiv.org/abs/2110.07031v1
- Date: Wed, 13 Oct 2021 21:00:00 GMT
- Title: Improving the Robustness to Variations of Objects and Instructions with
a Neuro-Symbolic Approach for Interactive Instruction Following
- Authors: Kazutoshi Shinoda and Yuki Takezawa and Masahiro Suzuki and Yusuke
Iwasawa and Yutaka Matsuo
- Abstract summary: An interactive instruction following task has been proposed as a benchmark for learning to map natural language instructions and first-person vision into sequences of actions.
We find that an existing end-to-end neural model for this task is not robust to variations of objects and language instructions.
We propose a neuro-symbolic approach that performs reasoning over high-level symbolic representations that are robust to small changes in raw inputs.
- Score: 23.197640949226756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: An interactive instruction following task has been proposed as a benchmark
for learning to map natural language instructions and first-person vision into
sequences of actions to interact with objects in a 3D simulated environment. We
find that an existing end-to-end neural model for this task is not robust to
variations of objects and language instructions. We assume that this problem is
due to the high sensitiveness of neural feature extraction to small changes in
vision and language inputs. To mitigate this problem, we propose a
neuro-symbolic approach that performs reasoning over high-level symbolic
representations that are robust to small changes in raw inputs. Our experiments
on the ALFRED dataset show that our approach significantly outperforms the
existing model by 18, 52, and 73 points in the success rate on the
ToggleObject, PickupObject, and SliceObject subtasks in unseen environments
respectively.
Related papers
- IAAO: Interactive Affordance Learning for Articulated Objects in 3D Environments [56.85804719947]
We present IAAO, a framework that builds an explicit 3D model for intelligent agents to gain understanding of articulated objects in their environment through interaction.
We first build hierarchical features and label fields for each object state using 3D Gaussian Splatting (3DGS) by distilling mask features and view-consistent labels from multi-view images.
We then perform object- and part-level queries on the 3D Gaussian primitives to identify static and articulated elements, estimating global transformations and local articulation parameters along with affordances.
arXiv Detail & Related papers (2025-04-09T12:36:48Z) - Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange [50.45953583802282]
We introduce a novel self-supervised learning (SSL) strategy for point cloud scene understanding.
Our approach leverages both object patterns and contextual cues to produce robust features.
Our experiments demonstrate the superiority of our method over existing SSL techniques.
arXiv Detail & Related papers (2024-04-11T06:39:53Z) - OSCaR: Object State Captioning and State Change Representation [52.13461424520107]
This paper introduces the Object State Captioning and State Change Representation (OSCaR) dataset and benchmark.
OSCaR consists of 14,084 annotated video segments with nearly 1,000 unique objects from various egocentric video collections.
It sets a new testbed for evaluating multimodal large language models (MLLMs)
arXiv Detail & Related papers (2024-02-27T01:48:19Z) - Controllable Human-Object Interaction Synthesis [77.56877961681462]
We propose Controllable Human-Object Interaction Synthesis (CHOIS) to generate synchronized object motion and human motion in 3D scenes.
Here, language descriptions inform style and intent, and waypoints, which can be effectively extracted from high-level planning, ground the motion in the scene.
Our module seamlessly integrates with a path planning module, enabling the generation of long-term interactions in 3D environments.
arXiv Detail & Related papers (2023-12-06T21:14:20Z) - Localizing Active Objects from Egocentric Vision with Symbolic World
Knowledge [62.981429762309226]
The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans virtually.
We propose to improve phrase grounding models' ability on localizing the active objects by: learning the role of objects undergoing change and extracting them accurately from the instructions.
We evaluate our framework on Ego4D and Epic-Kitchens datasets.
arXiv Detail & Related papers (2023-10-23T16:14:05Z) - Learning Neuro-symbolic Programs for Language Guided Robot Manipulation [10.287265801542999]
Given a natural language instruction, and an input and an output scene, our goal is to train a neuro-symbolic model which can output a manipulation program.
Prior approaches for this task possess one of the following limitations: (i) rely on hand-coded symbols for concepts limiting generalization beyond those seen during training but require dense sub-goal supervision.
Our approach is neuro-symbolic and can handle linguistic as well as perceptual variations, is end-to-end differentiable requiring no intermediate supervision, and makes use of symbolic reasoning constructs which operate on a latent neural object-centric representation.
arXiv Detail & Related papers (2022-11-12T12:31:17Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Episodic Transformer for Vision-and-Language Navigation [142.6236659368177]
This paper focuses on addressing two challenges: handling long sequence of subtasks, and understanding complex human instructions.
We propose Episodic Transformer (E.T.), a multimodal transformer that encodes language inputs and the full episode history of visual observations and actions.
Our approach sets a new state of the art on the challenging ALFRED benchmark, achieving 38.4% and 8.5% task success rates on seen and unseen test splits.
arXiv Detail & Related papers (2021-05-13T17:51:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.