Robot Sound Interpretation: Learning Visual-Audio Representations for
Voice-Controlled Robots
- URL: http://arxiv.org/abs/2109.02823v1
- Date: Tue, 7 Sep 2021 02:26:54 GMT
- Title: Robot Sound Interpretation: Learning Visual-Audio Representations for
Voice-Controlled Robots
- Authors: Peixin Chang, Shuijing Liu, Katherine Driggs-Campbell
- Abstract summary: We learn a representation that associates images and sound commands with minimal supervision.
Using this representation, we generate an intrinsic reward function to learn robotic tasks with reinforcement learning.
We show that our method outperforms previous work across various sound types and robotic tasks empirically.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Inspired by sensorimotor theory, we propose a novel pipeline for
voice-controlled robots. Previous work relies on explicit labels of sounds and
images as well as extrinsic reward functions. Not only do such approaches have
little resemblance to human sensorimotor development, but also require
hand-tuning rewards and extensive human labor. To address these problems, we
learn a representation that associates images and sound commands with minimal
supervision. Using this representation, we generate an intrinsic reward
function to learn robotic tasks with reinforcement learning. We demonstrate our
approach on three robot platforms, a TurtleBot3, a Kuka-IIWA arm, and a Kinova
Gen3 robot, which hear a command word, identify the associated target object,
and perform precise control to approach the target. We show that our method
outperforms previous work across various sound types and robotic tasks
empirically. We successfully deploy the policy learned in simulator to a
real-world Kinova Gen3.
Related papers
- Know your limits! Optimize the robot's behavior through self-awareness [11.021217430606042]
Recent human-robot imitation algorithms focus on following a reference human motion with high precision.
We introduce a deep-learning model that anticipates the robot's performance when imitating a given reference.
Our Self-AWare model (SAW) ranks potential robot behaviors based on various criteria, such as fall likelihood, adherence to the reference motion, and smoothness.
arXiv Detail & Related papers (2024-09-16T14:14:58Z) - Giving Robots a Hand: Learning Generalizable Manipulation with
Eye-in-Hand Human Video Demonstrations [66.47064743686953]
Eye-in-hand cameras have shown promise in enabling greater sample efficiency and generalization in vision-based robotic manipulation.
Videos of humans performing tasks, on the other hand, are much cheaper to collect since they eliminate the need for expertise in robotic teleoperation.
In this work, we augment narrow robotic imitation datasets with broad unlabeled human video demonstrations to greatly enhance the generalization of eye-in-hand visuomotor policies.
arXiv Detail & Related papers (2023-07-12T07:04:53Z) - Robot Learning with Sensorimotor Pre-training [98.7755895548928]
We present a self-supervised sensorimotor pre-training approach for robotics.
Our model, called RPT, is a Transformer that operates on sequences of sensorimotor tokens.
We find that sensorimotor pre-training consistently outperforms training from scratch, has favorable scaling properties, and enables transfer across different tasks, environments, and robots.
arXiv Detail & Related papers (2023-06-16T17:58:10Z) - Knowledge-Driven Robot Program Synthesis from Human VR Demonstrations [16.321053835017942]
We present a system for automatically generating executable robot control programs from human task demonstrations in virtual reality (VR)
We leverage common-sense knowledge and game engine-based physics to semantically interpret human VR demonstrations.
We demonstrate our approach in the context of force-sensitive fetch-and-place for a robotic shopping assistant.
arXiv Detail & Related papers (2023-06-05T09:37:53Z) - Learning Video-Conditioned Policies for Unseen Manipulation Tasks [83.2240629060453]
Video-conditioned Policy learning maps human demonstrations of previously unseen tasks to robot manipulation skills.
We learn our policy to generate appropriate actions given current scene observations and a video of the target task.
We validate our approach on a set of challenging multi-task robot manipulation environments and outperform state of the art.
arXiv Detail & Related papers (2023-05-10T16:25:42Z) - Affordances from Human Videos as a Versatile Representation for Robotics [31.248842798600606]
We train a visual affordance model that estimates where and how in the scene a human is likely to interact.
The structure of these behavioral affordances directly enables the robot to perform many complex tasks.
We show the efficacy of our approach, which we call VRB, across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild.
arXiv Detail & Related papers (2023-04-17T17:59:34Z) - Self-Improving Robots: End-to-End Autonomous Visuomotor Reinforcement
Learning [54.636562516974884]
In imitation and reinforcement learning, the cost of human supervision limits the amount of data that robots can be trained on.
In this work, we propose MEDAL++, a novel design for self-improving robotic systems.
The robot autonomously practices the task by learning to both do and undo the task, simultaneously inferring the reward function from the demonstrations.
arXiv Detail & Related papers (2023-03-02T18:51:38Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z) - Learning Reward Functions for Robotic Manipulation by Observing Humans [92.30657414416527]
We use unlabeled videos of humans solving a wide range of manipulation tasks to learn a task-agnostic reward function for robotic manipulation policies.
The learned rewards are based on distances to a goal in an embedding space learned using a time-contrastive objective.
arXiv Detail & Related papers (2022-11-16T16:26:48Z) - Signs of Language: Embodied Sign Language Fingerspelling Acquisition
from Demonstrations for Human-Robot Interaction [1.0166477175169308]
We propose an approach for learning dexterous motor imitation from video examples without additional information.
We first build a URDF model of a robotic hand with a single actuator for each joint.
We then leverage pre-trained deep vision models to extract the 3D pose of the hand from RGB videos.
arXiv Detail & Related papers (2022-09-12T10:42:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.