That Sounds Right: Auditory Self-Supervision for Dynamic Robot
Manipulation
- URL: http://arxiv.org/abs/2210.01116v1
- Date: Mon, 3 Oct 2022 17:57:09 GMT
- Title: That Sounds Right: Auditory Self-Supervision for Dynamic Robot
Manipulation
- Authors: Abitha Thankaraj and Lerrel Pinto
- Abstract summary: We propose a data-centric approach to dynamic manipulation that uses an often ignored source of information: sound.
We first collect a dataset of 25k interaction-sound pairs across five dynamic tasks using commodity contact microphones.
We then leverage self-supervised learning to accelerate behavior prediction from sound.
- Score: 19.051800747558794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to produce contact-rich, dynamic behaviors from raw sensory data has
been a longstanding challenge in robotics. Prominent approaches primarily focus
on using visual or tactile sensing, where unfortunately one fails to capture
high-frequency interaction, while the other can be too delicate for large-scale
data collection. In this work, we propose a data-centric approach to dynamic
manipulation that uses an often ignored source of information: sound. We first
collect a dataset of 25k interaction-sound pairs across five dynamic tasks
using commodity contact microphones. Then, given this data, we leverage
self-supervised learning to accelerate behavior prediction from sound. Our
experiments indicate that this self-supervised 'pretraining' is crucial to
achieving high performance, with a 34.5% lower MSE than plain supervised
learning and a 54.3% lower MSE over visual training. Importantly, we find that
when asked to generate desired sound profiles, online rollouts of our models on
a UR10 robot can produce dynamic behavior that achieves an average of 11.5%
improvement over supervised learning on audio similarity metrics.
Related papers
- Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Datasets [24.77850617214567]
We propose a foundation representation learning framework capturing both visual features and the dynamics information such as actions and proprioceptions of manipulation tasks.
Specifically, we pre-train a visual encoder on the DROID robotic dataset and leverage motion-relevant data such as robot proprioceptive states and actions.
We introduce a novel contrastive loss that aligns visual observations with the robot's proprioceptive state-action dynamics, combined with a behavior cloning (BC)-like actor loss to predict actions during pre-training, along with a time contrastive loss.
arXiv Detail & Related papers (2024-10-29T17:58:13Z) - Learning Manipulation by Predicting Interaction [85.57297574510507]
We propose a general pre-training pipeline that learns Manipulation by Predicting the Interaction.
The experimental results demonstrate that MPI exhibits remarkable improvement by 10% to 64% compared with previous state-of-the-art in real-world robot platforms.
arXiv Detail & Related papers (2024-06-01T13:28:31Z) - Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation [13.026061233933435]
Current paradigms only perform large-scale pretraining for visual representations.
It is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing.
In this paper, we address this gap by using contact microphones as an alternative tactile sensor.
arXiv Detail & Related papers (2024-05-14T13:16:46Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Self-Supervised Learning for Audio-Based Emotion Recognition [1.7598252755538808]
Self-supervised learning is a family of methods which can learn despite a scarcity of supervised labels.
We have applied self-supervised learning pre-training to the classification of emotions from the CMU- MOSEI's acoustic modality.
We find that self-supervised learning consistently improves the performance of the model across all metrics.
arXiv Detail & Related papers (2023-07-23T14:40:50Z) - Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual
Imitation Learning [62.83590925557013]
We learn a set of challenging partially-observed manipulation tasks from visual and audio inputs.
Our proposed system learns these tasks by combining offline imitation learning from tele-operated demonstrations and online finetuning.
In a set of simulated tasks, we find that our system benefits from using audio, and that by using online interventions we are able to improve the success rate of offline imitation learning by 20%.
arXiv Detail & Related papers (2022-05-30T04:52:58Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Federated Self-Training for Semi-Supervised Audio Recognition [0.23633885460047763]
In this work, we study the problem of semi-supervised learning of audio models via self-training.
We propose FedSTAR to exploit large-scale on-device unlabeled data to improve the generalization of audio recognition models.
arXiv Detail & Related papers (2021-07-14T17:40:10Z) - Recognizing More Emotions with Less Data Using Self-supervised Transfer
Learning [0.0]
We propose a novel transfer learning method for speech emotion recognition.
With as low as 125 examples per emotion class, we were able to reach a higher accuracy than a strong baseline trained on 8 times more data.
arXiv Detail & Related papers (2020-11-11T06:18:31Z) - Noisy Agents: Self-supervised Exploration by Predicting Auditory Events [127.82594819117753]
We propose a novel type of intrinsic motivation for Reinforcement Learning (RL) that encourages the agent to understand the causal effect of its actions.
We train a neural network to predict the auditory events and use the prediction errors as intrinsic rewards to guide RL exploration.
Experimental results on Atari games show that our new intrinsic motivation significantly outperforms several state-of-the-art baselines.
arXiv Detail & Related papers (2020-07-27T17:59:08Z) - Learning Predictive Models From Observation and Interaction [137.77887825854768]
Learning predictive models from interaction with the world allows an agent, such as a robot, to learn about how the world works.
However, learning a model that captures the dynamics of complex skills represents a major challenge.
We propose a method to augment the training set with observational data of other agents, such as humans.
arXiv Detail & Related papers (2019-12-30T01:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.