Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual
Imitation Learning
- URL: http://arxiv.org/abs/2205.14850v1
- Date: Mon, 30 May 2022 04:52:58 GMT
- Title: Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual
Imitation Learning
- Authors: Maximilian Du, Olivia Y. Lee, Suraj Nair, Chelsea Finn
- Abstract summary: We learn a set of challenging partially-observed manipulation tasks from visual and audio inputs.
Our proposed system learns these tasks by combining offline imitation learning from tele-operated demonstrations and online finetuning.
In a set of simulated tasks, we find that our system benefits from using audio, and that by using online interventions we are able to improve the success rate of offline imitation learning by 20%.
- Score: 62.83590925557013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans are capable of completing a range of challenging manipulation tasks
that require reasoning jointly over modalities such as vision, touch, and
sound. Moreover, many such tasks are partially-observed; for example, taking a
notebook out of a backpack will lead to visual occlusion and require reasoning
over the history of audio or tactile information. While robust tactile sensing
can be costly to capture on robots, microphones near or on a robot's gripper
are a cheap and easy way to acquire audio feedback of contact events, which can
be a surprisingly valuable data source for perception in the absence of vision.
Motivated by the potential for sound to mitigate visual occlusion, we aim to
learn a set of challenging partially-observed manipulation tasks from visual
and audio inputs. Our proposed system learns these tasks by combining offline
imitation learning from a modest number of tele-operated demonstrations and
online finetuning using human provided interventions. In a set of simulated
tasks, we find that our system benefits from using audio, and that by using
online interventions we are able to improve the success rate of offline
imitation learning by ~20%. Finally, we find that our system can complete a set
of challenging, partially-observed tasks on a Franka Emika Panda robot, like
extracting keys from a bag, with a 70% success rate, 50% higher than a policy
that does not use audio.
Related papers
- VITAL: Visual Teleoperation to Enhance Robot Learning through Human-in-the-Loop Corrections [10.49712834719005]
We propose a low-cost visual teleoperation system for bimanual manipulation tasks, called VITAL.
Our approach leverages affordable hardware and visual processing techniques to collect demonstrations.
We enhance the generalizability and robustness of the learned policies by utilizing both real and simulated environments.
arXiv Detail & Related papers (2024-07-30T23:29:47Z) - ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data [28.36623343236893]
We introduce ManiWAV: an 'ear-in-hand' data collection device to collect in-the-wild human demonstrations with synchronous audio and visual feedback.
We show that our system can generalize to unseen in-the-wild environments by learning from diverse in-the-wild human demonstrations.
arXiv Detail & Related papers (2024-06-27T18:06:38Z) - Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation [13.026061233933435]
Current paradigms only perform large-scale pretraining for visual representations.
It is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing.
In this paper, we address this gap by using contact microphones as an alternative tactile sensor.
arXiv Detail & Related papers (2024-05-14T13:16:46Z) - Self-Explainable Affordance Learning with Embodied Caption [63.88435741872204]
We introduce Self-Explainable Affordance learning (SEA) with embodied caption.
SEA enables robots to articulate their intentions and bridge the gap between explainable vision-language caption and visual affordance learning.
We propose a novel model to effectively combine affordance grounding with self-explanation in a simple but efficient manner.
arXiv Detail & Related papers (2024-04-08T15:22:38Z) - Egocentric Deep Multi-Channel Audio-Visual Active Speaker Localization [13.144367063836597]
We propose a novel end-to-end deep learning approach that is able to give robust voice activity detection and localization results.
Our experiments show that the proposed method gives superior results, can run in real time, and is robust against noise and clutter.
arXiv Detail & Related papers (2022-01-06T05:40:16Z) - Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural
Sounds [118.54908665440826]
Humans can robustly recognize and localize objects by using visual and/or auditory cues.
This work develops an approach for scene understanding purely based on sounds.
The co-existence of visual and audio cues is leveraged for supervision transfer.
arXiv Detail & Related papers (2021-09-06T22:24:00Z) - Visual Imitation Made Easy [102.36509665008732]
We present an alternate interface for imitation that simplifies the data collection process while allowing for easy transfer to robots.
We use commercially available reacher-grabber assistive tools both as a data collection device and as the robot's end-effector.
We experimentally evaluate on two challenging tasks: non-prehensile pushing and prehensile stacking, with 1000 diverse demonstrations for each task.
arXiv Detail & Related papers (2020-08-11T17:58:50Z) - Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds [106.87299276189458]
Humans can robustly recognize and localize objects by integrating visual and auditory cues.
This work develops an approach for dense semantic labelling of sound-making objects, purely based on sounds.
arXiv Detail & Related papers (2020-03-09T15:49:01Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.