Deep Sensory Substitution: Noninvasively Enabling Biological Neural
Networks to Receive Input from Artificial Neural Networks
- URL: http://arxiv.org/abs/2005.13291v3
- Date: Wed, 25 Aug 2021 23:20:52 GMT
- Title: Deep Sensory Substitution: Noninvasively Enabling Biological Neural
Networks to Receive Input from Artificial Neural Networks
- Authors: Andrew Port, Chelhwon Kim, Mitesh Patel
- Abstract summary: This work describes a novel technique for leveraging machine-learned feature embeddings to sonify visual information into a perceptual audio domain.
A generative adversarial network (GAN) is then used to find a distance preserving map from this metric space of feature vectors into the metric space defined by a target audio dataset.
In human subject tests, users were able to accurately classify audio sonifications of faces.
- Score: 5.478764356647437
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As is expressed in the adage "a picture is worth a thousand words", when
using spoken language to communicate visual information, brevity can be a
challenge. This work describes a novel technique for leveraging machine-learned
feature embeddings to sonify visual (and other types of) information into a
perceptual audio domain, allowing users to perceive this information using only
their aural faculty. The system uses a pretrained image embedding network to
extract visual features and embed them in a compact subset of Euclidean space
-- this converts the images into feature vectors whose $L^2$ distances can be
used as a meaningful measure of similarity. A generative adversarial network
(GAN) is then used to find a distance preserving map from this metric space of
feature vectors into the metric space defined by a target audio dataset
equipped with either the Euclidean metric or a mel-frequency cepstrum-based
psychoacoustic distance metric. We demonstrate this technique by sonifying
images of faces into human speech-like audio. For both target audio metrics,
the GAN successfully found a metric preserving mapping, and in human subject
tests, users were able to accurately classify audio sonifications of faces.
Related papers
- Geometry-Aware Multi-Task Learning for Binaural Audio Generation from
Video [94.42811508809994]
We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to audio.
Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process.
arXiv Detail & Related papers (2021-11-21T19:26:45Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - Voice Activity Detection for Transient Noisy Environment Based on
Diffusion Nets [13.558688470594674]
We address voice activity detection in acoustic environments of transients and stationary noises.
We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure.
A deep neural network is trained to separate speech from non-speech frames.
arXiv Detail & Related papers (2021-06-25T17:05:26Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Learning Representations from Audio-Visual Spatial Alignment [76.29670751012198]
We introduce a novel self-supervised pretext task for learning representations from audio-visual content.
The advantages of the proposed pretext task are demonstrated on a variety of audio and visual downstream tasks.
arXiv Detail & Related papers (2020-11-03T16:20:04Z) - Self-supervised Neural Audio-Visual Sound Source Localization via
Probabilistic Spatial Modeling [45.20508569656558]
This paper presents a self-supervised training method using 360deg images and multichannel audio signals.
By incorporating with the spatial information in multichannel audio signals, our method trains deep neural networks (DNNs) to distinguish multiple sound source objects.
We also demonstrate that the visual DNN detected objects including talking visitors and specific exhibits from real data recorded in a science museum.
arXiv Detail & Related papers (2020-07-28T03:52:53Z) - Face-to-Music Translation Using a Distance-Preserving Generative
Adversarial Network with an Auxiliary Discriminator [5.478764356647437]
We propose a distance-preserving generative adversarial model to translate images of human faces into an audio domain.
The audio domain is defined by a collection of musical note sounds recorded by 10 different instrument families.
To enforce distance-preservation, a loss term that penalizes difference between pairwise distances of the faces and the translated audio samples is used.
arXiv Detail & Related papers (2020-06-24T04:17:40Z) - Unsupervised Cross-Modal Audio Representation Learning from Unstructured
Multilingual Text [69.55642178336953]
We present an approach to unsupervised audio representation learning.
Based on a triplet neural network architecture, we harnesses semantically related cross-modal information to estimate audio track-relatedness.
We show that our approach is invariant to the variety of annotation styles as well as to the different languages of this collection.
arXiv Detail & Related papers (2020-03-27T07:37:15Z) - Unsupervised Learning of Audio Perception for Robotics Applications:
Learning to Project Data to T-SNE/UMAP space [2.8935588665357077]
This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data.
We show how we can leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision.
arXiv Detail & Related papers (2020-02-10T20:33:25Z) - AudioMNIST: Exploring Explainable Artificial Intelligence for Audio
Analysis on a Simple Benchmark [12.034688724153044]
This paper explores post-hoc explanations for deep neural networks in the audio domain.
We present a novel Open Source audio dataset consisting of 30,000 audio samples of English spoken digits.
We demonstrate the superior interpretability of audible explanations over visual ones in a human user study.
arXiv Detail & Related papers (2018-07-09T23:11:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.