Switching Head-Tail Funnel UNITER for Dual Referring Expression
Comprehension with Fetch-and-Carry Tasks
- URL: http://arxiv.org/abs/2307.07166v1
- Date: Fri, 14 Jul 2023 05:27:56 GMT
- Title: Switching Head-Tail Funnel UNITER for Dual Referring Expression
Comprehension with Fetch-and-Carry Tasks
- Authors: Ryosuke Korekata, Motonari Kambara, Yu Yoshida, Shintaro Ishikawa,
Yosuke Kawasaki, Masaki Takahashi, Komei Sugiura
- Abstract summary: This paper describes a domestic service robot (DSR) that fetches everyday objects and carries them to specified destinations according to free-form natural language instructions.
Most of the existing multimodal language understanding methods are impractical in terms of computational complexity.
We propose Switching Head-Tail Funnel UNITER, which solves the task by predicting the target object and the destination individually using a single model.
- Score: 3.248019437833647
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes a domestic service robot (DSR) that fetches everyday
objects and carries them to specified destinations according to free-form
natural language instructions. Given an instruction such as "Move the bottle on
the left side of the plate to the empty chair," the DSR is expected to identify
the bottle and the chair from multiple candidates in the environment and carry
the target object to the destination. Most of the existing multimodal language
understanding methods are impractical in terms of computational complexity
because they require inferences for all combinations of target object
candidates and destination candidates. We propose Switching Head-Tail Funnel
UNITER, which solves the task by predicting the target object and the
destination individually using a single model. Our method is validated on a
newly-built dataset consisting of object manipulation instructions and semi
photo-realistic images captured in a standard Embodied AI simulator. The
results show that our method outperforms the baseline method in terms of
language comprehension accuracy. Furthermore, we conduct physical experiments
in which a DSR delivers standardized everyday objects in a standardized
domestic environment as requested by instructions with referring expressions.
The experimental results show that the object grasping and placing actions are
achieved with success rates of more than 90%.
Related papers
- NaturalVLM: Leveraging Fine-grained Natural Language for
Affordance-Guided Visual Manipulation [21.02437461550044]
Many real-world tasks demand intricate multi-step reasoning.
We introduce a benchmark, NrVLM, comprising 15 distinct manipulation tasks.
We propose a novel learning framework that completes the manipulation task step-by-step according to the fine-grained instructions.
arXiv Detail & Related papers (2024-03-13T09:12:16Z) - OpenFMNav: Towards Open-Set Zero-Shot Object Navigation via Vision-Language Foundation Models [16.50443396055173]
We propose OpenFMNav, an Open-set Foundation Model based framework for zero-shot object navigation.
We first unleash the reasoning abilities of large language models to extract proposed objects from natural language instructions.
We then leverage the generalizability of large vision language models to actively discover and detect candidate objects from the scene.
arXiv Detail & Related papers (2024-02-16T13:21:33Z) - Learning-To-Rank Approach for Identifying Everyday Objects Using a
Physical-World Search Engine [0.8749675983608172]
We focus on the task of retrieving target objects from open-vocabulary user instructions in a human-in-the-loop setting.
We propose MultiRankIt, which is a novel approach for the learning-to-rank physical objects task.
arXiv Detail & Related papers (2023-12-26T01:40:31Z) - Goal Representations for Instruction Following: A Semi-Supervised
Language Interface to Control [58.06223121654735]
We show a method that taps into joint image- and goal- conditioned policies with language using only a small amount of language data.
Our method achieves robust performance in the real world by learning an embedding from the labeled data that aligns language not to the goal image.
We show instruction following across a variety of manipulation tasks in different scenes, with generalization to language instructions outside of the labeled data.
arXiv Detail & Related papers (2023-06-30T20:09:39Z) - Learning to Solve Voxel Building Embodied Tasks from Pixels and Natural
Language Instructions [53.21504989297547]
We propose a new method that combines a language model and reinforcement learning for the task of building objects in a Minecraft-like environment.
Our method first generates a set of consistently achievable sub-goals from the instructions and then completes associated sub-tasks with a pre-trained RL policy.
arXiv Detail & Related papers (2022-11-01T18:30:42Z) - Grounding Language with Visual Affordances over Unstructured Data [26.92329260907805]
We propose a novel approach to efficiently learn language-conditioned robot skills from unstructured, offline and reset-free data.
We exploit a self-supervised visuo-lingual affordance model, which requires as little as 1% of the total data with language.
We find that our method is capable of completing long-horizon, multi-tier tasks in the real world, while requiring an order of magnitude less data than previous approaches.
arXiv Detail & Related papers (2022-10-04T21:16:48Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Target-dependent UNITER: A Transformer-Based Multimodal Language
Comprehension Model for Domestic Service Robots [0.0]
We propose Target-dependent UNITER, which learns the relationship between the target object and other objects directly by focusing on the relevant regions within an image.
Our method is an extension of the UNITER-based Transformer that can be pretrained on general-purpose datasets.
Our model is validated on two standard datasets, and the results show that Target-dependent UNITER outperforms the baseline method in terms of classification accuracy.
arXiv Detail & Related papers (2021-07-02T03:11:02Z) - Object-and-Action Aware Model for Visual Language Navigation [70.33142095637515]
Vision-and-Language Navigation (VLN) is unique in that it requires turning relatively general natural-language instructions into robot agent actions.
We propose an Object-and-Action Aware Model (OAAM) that processes these two different forms of natural language based instruction separately.
This enables each process to match object-centered/action-centered instruction to their own counterpart visual perception/action orientation flexibly.
arXiv Detail & Related papers (2020-07-29T06:32:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.