Learning Flexible Translation between Robot Actions and Language
Descriptions
- URL: http://arxiv.org/abs/2207.07437v1
- Date: Fri, 15 Jul 2022 12:37:05 GMT
- Title: Learning Flexible Translation between Robot Actions and Language
Descriptions
- Authors: Ozan \"Ozdemir, Matthias Kerzel, Cornelius Weber, Jae Hee Lee, Stefan
Wermter
- Abstract summary: We propose a paired gated autoencoders (PGAE) for flexible translation between robot actions and language descriptions.
We train our model in an end-to-end fashion by pairing each action with appropriate descriptions that contain a signal informing about the translation direction.
With the option to use a pretrained language model as the language encoder, our model has the potential to recognise unseen natural language input.
- Score: 16.538887534958555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Handling various robot action-language translation tasks flexibly is an
essential requirement for natural interaction between a robot and a human.
Previous approaches require change in the configuration of the model
architecture per task during inference, which undermines the premise of
multi-task learning. In this work, we propose the paired gated autoencoders
(PGAE) for flexible translation between robot actions and language descriptions
in a tabletop object manipulation scenario. We train our model in an end-to-end
fashion by pairing each action with appropriate descriptions that contain a
signal informing about the translation direction. During inference, our model
can flexibly translate from action to language and vice versa according to the
given language signal. Moreover, with the option to use a pretrained language
model as the language encoder, our model has the potential to recognise unseen
natural language input. Another capability of our model is that it can
recognise and imitate actions of another agent by utilising robot
demonstrations. The experiment results highlight the flexible bidirectional
translation capabilities of our approach alongside with the ability to
generalise to the actions of the opposite-sitting agent.
Related papers
- Context-Aware Command Understanding for Tabletop Scenarios [1.7082212774297747]
This paper presents a novel hybrid algorithm designed to interpret natural human commands in tabletop scenarios.
By integrating multiple sources of information, including speech, gestures, and scene context, the system extracts actionable instructions for a robot.
We discuss the strengths and limitations of the system, with particular focus on how it handles multimodal command interpretation.
arXiv Detail & Related papers (2024-10-08T20:46:39Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - Contrastive Language, Action, and State Pre-training for Robot Learning [1.1000499414131326]
We introduce a method for unifying language, action, and state information in a shared embedding space to facilitate a range of downstream tasks in robot learning.
Our method, Contrastive Language, Action, and State Pre-training (CLASP), extends the CLIP formulation by incorporating distributional learning, capturing the inherent complexities and one-to-many relationships in behaviour-text alignment.
We demonstrate the utility of our method for the following downstream tasks: zero-shot text-behaviour retrieval, captioning unseen robot behaviours, and learning a behaviour prior to language-conditioned reinforcement learning.
arXiv Detail & Related papers (2023-04-21T07:19:33Z) - PaLM-E: An Embodied Multimodal Language Model [101.29116156731762]
We propose embodied language models to incorporate real-world continuous sensor modalities into language models.
We train these encodings end-to-end, in conjunction with a pre-trained large language model, for multiple embodied tasks.
Our largest model, PaLM-E-562B with 562B parameters, is a visual-language generalist with state-of-the-art performance on OK-VQA.
arXiv Detail & Related papers (2023-03-06T18:58:06Z) - Language-Driven Representation Learning for Robotics [115.93273609767145]
Recent work in visual representation learning for robotics demonstrates the viability of learning from large video datasets of humans performing everyday tasks.
We introduce a framework for language-driven representation learning from human videos and captions.
We find that Voltron's language-driven learning outperform the prior-of-the-art, especially on targeted problems requiring higher-level control.
arXiv Detail & Related papers (2023-02-24T17:29:31Z) - "No, to the Right" -- Online Language Corrections for Robotic
Manipulation via Shared Autonomy [70.45420918526926]
We present LILAC, a framework for incorporating and adapting to natural language corrections online during execution.
Instead of discrete turn-taking between a human and robot, LILAC splits agency between the human and robot.
We show that our corrections-aware approach obtains higher task completion rates, and is subjectively preferred by users.
arXiv Detail & Related papers (2023-01-06T15:03:27Z) - Do As I Can, Not As I Say: Grounding Language in Robotic Affordances [119.29555551279155]
Large language models can encode a wealth of semantic knowledge about the world.
Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language.
We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions.
arXiv Detail & Related papers (2022-04-04T17:57:11Z) - Reshaping Robot Trajectories Using Natural Language Commands: A Study of
Multi-Modal Data Alignment Using Transformers [33.7939079214046]
We provide a flexible language-based interface for human-robot collaboration.
We take advantage of recent advancements in the field of large language models to encode the user command.
We train the model using imitation learning over a dataset containing robot trajectories modified by language commands.
arXiv Detail & Related papers (2022-03-25T01:36:56Z) - Language Model-Based Paired Variational Autoencoders for Robotic Language Learning [18.851256771007748]
Similar to human infants, artificial agents can learn language while interacting with their environment.
We present a neural model that bidirectionally binds robot actions and their language descriptions in a simple object manipulation scenario.
Next, we introduce PVAE-BERT, which equips the model with a pretrained large-scale language model.
arXiv Detail & Related papers (2022-01-17T10:05:26Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.