Swoosh! Rattle! Thump! -- Actions that Sound
- URL: http://arxiv.org/abs/2007.01851v1
- Date: Fri, 3 Jul 2020 17:57:54 GMT
- Title: Swoosh! Rattle! Thump! -- Actions that Sound
- Authors: Dhiraj Gandhi, Abhinav Gupta, Lerrel Pinto
- Abstract summary: This work is the first large-scale study of the interactions between sound and robotic action.
We create the largest available sound-action-vision dataset with 15,000 interactions on 60 objects using our robotic platform Tilt-Bot.
Sound is indicative of fine-grained object class information, e.g., sound can differentiate a metal screwdriver from a metal wrench.
- Score: 38.59779002672538
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Truly intelligent agents need to capture the interplay of all their senses to
build a rich physical understanding of their world. In robotics, we have seen
tremendous progress in using visual and tactile perception; however, we have
often ignored a key sense: sound. This is primarily due to the lack of data
that captures the interplay of action and sound. In this work, we perform the
first large-scale study of the interactions between sound and robotic action.
To do this, we create the largest available sound-action-vision dataset with
15,000 interactions on 60 objects using our robotic platform Tilt-Bot. By
tilting objects and allowing them to crash into the walls of a robotic tray, we
collect rich four-channel audio information. Using this data, we explore the
synergies between sound and action and present three key insights. First, sound
is indicative of fine-grained object class information, e.g., sound can
differentiate a metal screwdriver from a metal wrench. Second, sound also
contains information about the causal effects of an action, i.e. given the
sound produced, we can predict what action was applied to the object. Finally,
object representations derived from audio embeddings are indicative of implicit
physical properties. We demonstrate that on previously unseen objects, audio
embeddings generated through interactions can predict forward models 24% better
than passive visual embeddings. Project videos and data are at
https://dhiraj100892.github.io/swoosh/
Related papers
- ANAVI: Audio Noise Awareness using Visuals of Indoor environments for NAVIgation [26.460679530665487]
We propose Audio Noise Awareness using Visuals of Indoors for NAVIgation for quieter robot path planning.
We generate data on how loud an 'impulse' sounds at different listener locations in simulated homes, and train our Acoustic Noise Predictor (ANP)
Unifying ANP with action acoustics, we demonstrate experiments with wheeled (Hello Robot Stretch) and legged (Unitree Go2) robots so that these robots adhere to the noise constraints of the environment.
arXiv Detail & Related papers (2024-10-24T17:19:53Z) - You Only Speak Once to See [24.889319740761827]
Grounding objects in images using visual cues is a well-established approach in computer vision.
We introduce YOSS, "You Only Speak Once to See," to leverage audio for grounding objects in visual scenes.
Experimental results indicate that audio guidance can be effectively applied to object grounding.
arXiv Detail & Related papers (2024-09-27T01:16:15Z) - ManiWAV: Learning Robot Manipulation from In-the-Wild Audio-Visual Data [28.36623343236893]
We introduce ManiWAV: an 'ear-in-hand' data collection device to collect in-the-wild human demonstrations with synchronous audio and visual feedback.
We show that our system can generalize to unseen in-the-wild environments by learning from diverse in-the-wild human demonstrations.
arXiv Detail & Related papers (2024-06-27T18:06:38Z) - Hearing Touch: Audio-Visual Pretraining for Contact-Rich Manipulation [13.026061233933435]
Current paradigms only perform large-scale pretraining for visual representations.
It is unclear what relevant internet-scale data may be used for pretraining other modalities such as tactile sensing.
In this paper, we address this gap by using contact microphones as an alternative tactile sensor.
arXiv Detail & Related papers (2024-05-14T13:16:46Z) - Open-World Object Manipulation using Pre-trained Vision-Language Models [72.87306011500084]
For robots to follow instructions from people, they must be able to connect the rich semantic information in human vocabulary.
We develop a simple approach, which leverages a pre-trained vision-language model to extract object-identifying information.
In a variety of experiments on a real mobile manipulator, we find that MOO generalizes zero-shot to a wide range of novel object categories and environments.
arXiv Detail & Related papers (2023-03-02T01:55:10Z) - Epic-Sounds: A Large-scale Dataset of Actions That Sound [64.24297230981168]
Epic-Sounds is a large-scale dataset of audio annotations capturing temporal extents and class labels.
We identify actions that can be discriminated purely from audio, through grouping these free-form descriptions of audio into classes.
Overall, Epic-Sounds includes 78.4k categorised segments of audible events and actions, distributed across 44 classes as well as 39.2k non-categorised segments.
arXiv Detail & Related papers (2023-02-01T18:19:37Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Move2Hear: Active Audio-Visual Source Separation [90.16327303008224]
We introduce the active audio-visual source separation problem, where an agent must move intelligently in order to better isolate the sounds coming from an object of interest.
We introduce a reinforcement learning approach that trains movement policies controlling the agent's camera and microphone placement over time.
We demonstrate our model's ability to find minimal movement sequences with maximal payoff for audio source separation.
arXiv Detail & Related papers (2021-05-15T04:58:08Z) - Learning to Set Waypoints for Audio-Visual Navigation [89.42192208471735]
In audio-visual navigation, an agent intelligently travels through a complex, unmapped 3D environment using both sights and sounds to find a sound source.
Existing models learn to act at a fixed granularity of agent motion and rely on simple recurrent aggregations of the audio observations.
We introduce a reinforcement learning approach to audio-visual navigation with two key novel elements.
arXiv Detail & Related papers (2020-08-21T18:00:33Z) - Unsupervised Learning of Audio Perception for Robotics Applications:
Learning to Project Data to T-SNE/UMAP space [2.8935588665357077]
This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data.
We show how we can leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision.
arXiv Detail & Related papers (2020-02-10T20:33:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.