Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural
Sounds
- URL: http://arxiv.org/abs/2109.02763v1
- Date: Mon, 6 Sep 2021 22:24:00 GMT
- Title: Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural
Sounds
- Authors: Dengxin Dai, Arun Balajee Vasudevan, Jiri Matas, and Luc Van Gool
- Abstract summary: Humans can robustly recognize and localize objects by using visual and/or auditory cues.
This work develops an approach for scene understanding purely based on sounds.
The co-existence of visual and audio cues is leveraged for supervision transfer.
- Score: 118.54908665440826
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans can robustly recognize and localize objects by using visual and/or
auditory cues. While machines are able to do the same with visual data already,
less work has been done with sounds. This work develops an approach for scene
understanding purely based on binaural sounds. The considered tasks include
predicting the semantic masks of sound-making objects, the motion of
sound-making objects, and the depth map of the scene. To this aim, we propose a
novel sensor setup and record a new audio-visual dataset of street scenes with
eight professional binaural microphones and a 360-degree camera. The
co-existence of visual and audio cues is leveraged for supervision transfer. In
particular, we employ a cross-modal distillation framework that consists of
multiple vision teacher methods and a sound student method -- the student
method is trained to generate the same results as the teacher methods do. This
way, the auditory system can be trained without using human annotations. To
further boost the performance, we propose another novel auxiliary task, coined
Spatial Sound Super-Resolution, to increase the directional resolution of
sounds. We then formulate the four tasks into one end-to-end trainable
multi-tasking network aiming to boost the overall performance. Experimental
results show that 1) our method achieves good results for all four tasks, 2)
the four tasks are mutually beneficial -- training them together achieves the
best performance, 3) the number and orientation of microphones are both
important, and 4) features learned from the standard spectrogram and features
obtained by the classic signal processing pipeline are complementary for
auditory perception tasks. The data and code are released.
Related papers
- Audio Representation Learning by Distilling Video as Privileged
Information [25.71206255965502]
We propose a novel approach for deep audio representation learning using audio-visual data when the video modality is absent at inference.
We adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI)
We show considerable improvements over sole audio-based recognition as well as prior works that use LUPI.
arXiv Detail & Related papers (2023-02-06T15:09:34Z) - Play it by Ear: Learning Skills amidst Occlusion through Audio-Visual
Imitation Learning [62.83590925557013]
We learn a set of challenging partially-observed manipulation tasks from visual and audio inputs.
Our proposed system learns these tasks by combining offline imitation learning from tele-operated demonstrations and online finetuning.
In a set of simulated tasks, we find that our system benefits from using audio, and that by using online interventions we are able to improve the success rate of offline imitation learning by 20%.
arXiv Detail & Related papers (2022-05-30T04:52:58Z) - Learning music audio representations via weak language supervision [14.335950077921435]
We design a multimodal architecture for music and language pre-training (MuLaP) optimised via a set of proxy tasks.
weak supervision is provided in the form of noisy natural language descriptions conveying the overall musical content of the track.
We demonstrate the usefulness of our approach by comparing the performance of audio representations produced by the same audio backbone with different training strategies.
arXiv Detail & Related papers (2021-12-08T10:30:52Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - Cyclic Co-Learning of Sounding Object Visual Grounding and Sound
Separation [52.550684208734324]
We propose a cyclic co-learning paradigm that can jointly learn sounding object visual grounding and audio-visual sound separation.
In this paper, we show that the proposed framework outperforms the compared recent approaches on both tasks.
arXiv Detail & Related papers (2021-04-05T17:30:41Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Learning Speech Representations from Raw Audio by Joint Audiovisual
Self-Supervision [63.564385139097624]
We propose a method to learn self-supervised speech representations from the raw audio waveform.
We train a raw audio encoder by combining audio-only self-supervision (by predicting informative audio attributes) with visual self-supervision (by generating talking faces from audio)
Our results demonstrate the potential of multimodal self-supervision in audiovisual speech for learning good audio representations.
arXiv Detail & Related papers (2020-07-08T14:07:06Z) - Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds [106.87299276189458]
Humans can robustly recognize and localize objects by integrating visual and auditory cues.
This work develops an approach for dense semantic labelling of sound-making objects, purely based on sounds.
arXiv Detail & Related papers (2020-03-09T15:49:01Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.