Caption Generation of Robot Behaviors based on Unsupervised Learning of
Action Segments
- URL: http://arxiv.org/abs/2003.10066v1
- Date: Mon, 23 Mar 2020 03:44:56 GMT
- Title: Caption Generation of Robot Behaviors based on Unsupervised Learning of
Action Segments
- Authors: Koichiro Yoshino, Kohei Wakimoto, Yuta Nishimura, Satoshi Nakamura
- Abstract summary: Bridging robot action sequences and their natural language captions is an important task to increase explainability of human assisting robots.
In this paper, we propose a system for generating natural language captions that describe behaviors of human assisting robots.
- Score: 10.356412004005767
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bridging robot action sequences and their natural language captions is an
important task to increase explainability of human assisting robots in their
recently evolving field. In this paper, we propose a system for generating
natural language captions that describe behaviors of human assisting robots.
The system describes robot actions by using robot observations; histories from
actuator systems and cameras, toward end-to-end bridging between robot actions
and natural language captions. Two reasons make it challenging to apply
existing sequence-to-sequence models to this mapping: 1) it is hard to prepare
a large-scale dataset for any kind of robots and their environment, and 2)
there is a gap between the number of samples obtained from robot action
observations and generated word sequences of captions. We introduced
unsupervised segmentation based on K-means clustering to unify typical robot
observation patterns into a class. This method makes it possible for the
network to learn the relationship from a small amount of data. Moreover, we
utilized a chunking method based on byte-pair encoding (BPE) to fill in the gap
between the number of samples of robot action observations and words in a
caption. We also applied an attention mechanism to the segmentation task.
Experimental results show that the proposed model based on unsupervised
learning can generate better descriptions than other methods. We also show that
the attention mechanism did not work well in our low-resource setting.
Related papers
- Adapt2Reward: Adapting Video-Language Models to Generalizable Robotic Rewards via Failure Prompts [21.249837293326497]
Generalizable reward function is central to reinforcement learning and planning for robots.
This paper transfers video-language models with robust generalization into a language-conditioned reward function.
Our model shows outstanding generalization to new environments and new instructions for robot planning and reinforcement learning.
arXiv Detail & Related papers (2024-07-20T13:22:59Z) - Interactive Planning Using Large Language Models for Partially
Observable Robotics Tasks [54.60571399091711]
Large Language Models (LLMs) have achieved impressive results in creating robotic agents for performing open vocabulary tasks.
We present an interactive planning technique for partially observable tasks using LLMs.
arXiv Detail & Related papers (2023-12-11T22:54:44Z) - RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic
Control [140.48218261864153]
We study how vision-language models trained on Internet-scale data can be incorporated directly into end-to-end robotic control.
Our approach leads to performant robotic policies and enables RT-2 to obtain a range of emergent capabilities from Internet-scale training.
arXiv Detail & Related papers (2023-07-28T21:18:02Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Summarizing a virtual robot's past actions in natural language [0.3553493344868413]
We show how a popular dataset that matches robot actions with natural language descriptions designed for an instruction following task can be repurposed to serve as a training ground for robot action summarization work.
We propose and test several methods of learning to generate such summaries, starting from either egocentric video frames of the robot taking actions or intermediate text representations of the actions used by an automatic planner.
arXiv Detail & Related papers (2022-03-13T15:00:46Z) - Learning Language-Conditioned Robot Behavior from Offline Data and
Crowd-Sourced Annotation [80.29069988090912]
We study the problem of learning a range of vision-based manipulation tasks from a large offline dataset of robot interaction.
We propose to leverage offline robot datasets with crowd-sourced natural language labels.
We find that our approach outperforms both goal-image specifications and language conditioned imitation techniques by more than 25%.
arXiv Detail & Related papers (2021-09-02T17:42:13Z) - Learning a generative model for robot control using visual feedback [7.171234436165255]
We introduce a novel formulation for incorporating visual feedback in controlling robots.
Inference in the model allows us to infer the robot state corresponding to target locations of the features.
We demonstrate the effectiveness of our method by executing grasping and tight-fit insertions on robots with inaccurate controllers.
arXiv Detail & Related papers (2020-03-10T00:34:01Z) - Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works.
However, learning a model that captures the dynamics of complex skills represents a major challenge.
We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.