Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds
- URL: http://arxiv.org/abs/2003.04210v1
- Date: Mon, 9 Mar 2020 15:49:01 GMT
- Title: Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds
- Authors: Arun Balajee Vasudevan, Dengxin Dai, Luc Van Gool
- Abstract summary: Humans can robustly recognize and localize objects by integrating visual and auditory cues.
This work develops an approach for dense semantic labelling of sound-making objects, purely based on sounds.
- Score: 106.87299276189458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Humans can robustly recognize and localize objects by integrating visual and
auditory cues. While machines are able to do the same now with images, less
work has been done with sounds. This work develops an approach for dense
semantic labelling of sound-making objects, purely based on binaural sounds. We
propose a novel sensor setup and record a new audio-visual dataset of street
scenes with eight professional binaural microphones and a 360 degree camera.
The co-existence of visual and audio cues is leveraged for supervision
transfer. In particular, we employ a cross-modal distillation framework that
consists of a vision `teacher' method and a sound `student' method -- the
student method is trained to generate the same results as the teacher method.
This way, the auditory system can be trained without using human annotations.
We also propose two auxiliary tasks namely, a) a novel task on Spatial Sound
Super-resolution to increase the spatial resolution of sounds, and b) dense
depth prediction of the scene. We then formulate the three tasks into one
end-to-end trainable multi-tasking network aiming to boost the overall
performance. Experimental results on the dataset show that 1) our method
achieves promising results for semantic prediction and the two auxiliary tasks;
and 2) the three tasks are mutually beneficial -- training them together
achieves the best performance and 3) the number and orientations of microphones
are both important. The data and code will be released to facilitate the
research in this new direction.
Related papers
- Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos [69.79632907349489]
We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos.
Our method uses a masked auto-encoding framework to synthesize masked (multi-channel) audio through the synergy of audio and vision.
arXiv Detail & Related papers (2023-07-10T17:58:17Z) - Audio Representation Learning by Distilling Video as Privileged
Information [25.71206255965502]
We propose a novel approach for deep audio representation learning using audio-visual data when the video modality is absent at inference.
We adopt teacher-student knowledge distillation under the framework of learning using privileged information (LUPI)
We show considerable improvements over sole audio-based recognition as well as prior works that use LUPI.
arXiv Detail & Related papers (2023-02-06T15:09:34Z) - Jointly Learning Visual and Auditory Speech Representations from Raw
Data [108.68531445641769]
RAVEn is a self-supervised multi-modal approach to jointly learn visual and auditory speech representations.
Our design is asymmetric w.r.t. driven by the inherent differences between video and audio.
RAVEn surpasses all self-supervised methods on visual speech recognition.
arXiv Detail & Related papers (2022-12-12T21:04:06Z) - Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks [20.316239155843963]
We propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks.
On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset.
arXiv Detail & Related papers (2021-10-14T12:32:40Z) - Binaural SoundNet: Predicting Semantics, Depth and Motion with Binaural
Sounds [118.54908665440826]
Humans can robustly recognize and localize objects by using visual and/or auditory cues.
This work develops an approach for scene understanding purely based on sounds.
The co-existence of visual and audio cues is leveraged for supervision transfer.
arXiv Detail & Related papers (2021-09-06T22:24:00Z) - Cyclic Co-Learning of Sounding Object Visual Grounding and Sound
Separation [52.550684208734324]
We propose a cyclic co-learning paradigm that can jointly learn sounding object visual grounding and audio-visual sound separation.
In this paper, we show that the proposed framework outperforms the compared recent approaches on both tasks.
arXiv Detail & Related papers (2021-04-05T17:30:41Z) - Look, Listen, and Attend: Co-Attention Network for Self-Supervised
Audio-Visual Representation Learning [17.6311804187027]
An underlying correlation between audio and visual events can be utilized as free supervised information to train a neural network.
We propose a novel self-supervised framework with co-attention mechanism to learn generic cross-modal representations from unlabelled videos.
Experiments show that our model achieves state-of-the-art performance on the pretext task while having fewer parameters compared with existing methods.
arXiv Detail & Related papers (2020-08-13T10:08:12Z) - Unsupervised Learning of Audio Perception for Robotics Applications:
Learning to Project Data to T-SNE/UMAP space [2.8935588665357077]
This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data.
We show how we can leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision.
arXiv Detail & Related papers (2020-02-10T20:33:25Z) - Deep Audio-Visual Learning: A Survey [53.487938108404244]
We divide the current audio-visual learning tasks into four different subfields.
We discuss state-of-the-art methods as well as the remaining challenges of each subfield.
We summarize the commonly used datasets and performance metrics.
arXiv Detail & Related papers (2020-01-14T13:11:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.