Unsupervised Learning of Audio Perception for Robotics Applications:
Learning to Project Data to T-SNE/UMAP space
- URL: http://arxiv.org/abs/2002.04076v1
- Date: Mon, 10 Feb 2020 20:33:25 GMT
- Title: Unsupervised Learning of Audio Perception for Robotics Applications:
Learning to Project Data to T-SNE/UMAP space
- Authors: Prateek Verma, Kenneth Salisbury
- Abstract summary: This paper builds upon key ideas to build perception of touch sounds without access to any ground-truth data.
We show how we can leverage ideas from classical signal processing to get large amounts of data of any sound of interest with a high precision.
- Score: 2.8935588665357077
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio perception is a key to solving a variety of problems ranging from
acoustic scene analysis, music meta-data extraction, recommendation, synthesis
and analysis. It can potentially also augment computers in doing tasks that
humans do effortlessly in day-to-day activities. This paper builds upon key
ideas to build perception of touch sounds without access to any ground-truth
data. We show how we can leverage ideas from classical signal processing to get
large amounts of data of any sound of interest with a high precision. These
sounds are then used, along with the images to map the sounds to a clustered
space of the latent representation of these images. This approach, not only
allows us to learn semantic representation of the possible sounds of interest,
but also allows association of different modalities to the learned
distinctions. The model trained to map sounds to this clustered representation,
gives reasonable performance as opposed to expensive methods collecting a lot
of human annotated data. Such approaches can be used to build a state of art
perceptual model for any sound of interest described using a few signal
processing features. Daisy chaining high precision sound event detectors using
signal processing combined with neural architectures and high dimensional
clustering of unlabelled data is a vastly powerful idea, and can be explored in
a variety of ways in future.
Related papers
- AV-GS: Learning Material and Geometry Aware Priors for Novel View Acoustic Synthesis [62.33446681243413]
view acoustic synthesis aims to render audio at any target viewpoint, given a mono audio emitted by a sound source at a 3D scene.
Existing methods have proposed NeRF-based implicit models to exploit visual cues as a condition for synthesizing audio.
We propose a novel Audio-Visual Gaussian Splatting (AV-GS) model to characterize the entire scene environment.
Experiments validate the superiority of our AV-GS over existing alternatives on the real-world RWAS and simulation-based SoundSpaces datasets.
arXiv Detail & Related papers (2024-06-13T08:34:12Z) - Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark [65.79402756995084]
Real Acoustic Fields (RAF) is a new dataset that captures real acoustic room data from multiple modalities.
RAF is the first dataset to provide densely captured room acoustic data.
arXiv Detail & Related papers (2024-03-27T17:59:56Z) - Novel-View Acoustic Synthesis [140.1107768313269]
We introduce the novel-view acoustic synthesis (NVAS) task.
given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint?
We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space.
arXiv Detail & Related papers (2023-01-20T18:49:58Z) - Metric-based multimodal meta-learning for human movement identification
via footstep recognition [3.300376360949452]
We describe a novel metric-based learning approach that introduces a multimodal framework.
We learn general-purpose representations from low multisensory data obtained from omnipresent sensing systems.
Our results employ a metric-based contrastive learning approach for multi-sensor data to mitigate the impact of data scarcity.
arXiv Detail & Related papers (2021-11-15T18:46:14Z) - Self-supervised Audiovisual Representation Learning for Remote Sensing Data [96.23611272637943]
We propose a self-supervised approach for pre-training deep neural networks in remote sensing.
By exploiting the correspondence between geo-tagged audio recordings and remote sensing, this is done in a completely label-free manner.
We show that our approach outperforms existing pre-training strategies for remote sensing imagery.
arXiv Detail & Related papers (2021-08-02T07:50:50Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Predicting Emotions Perceived from Sounds [2.9398911304923447]
Sonification is the science of communication of data and events to users through sounds.
This paper conducts an experiment through which several mainstream and conventional machine learning algorithms are developed.
It is possible to predict perceived emotions with high accuracy.
arXiv Detail & Related papers (2020-12-04T15:01:59Z) - A Framework for Generative and Contrastive Learning of Audio
Representations [2.8935588665357077]
We present a framework for contrastive learning for audio representations in a self supervised frame work without access to ground truth labels.
We also explore generative models based on state of the art transformer based architectures for learning latent spaces for audio signals.
Our system achieves considerable performance, compared to a fully supervised method, with access to ground truth labels to train the neural network model.
arXiv Detail & Related papers (2020-10-22T05:52:32Z) - COALA: Co-Aligned Autoencoders for Learning Semantically Enriched Audio
Representations [32.456824945999465]
We propose a method for learning audio representations, aligning the learned latent representations of audio and associated tags.
We evaluate the quality of our embedding model, measuring its performance as a feature extractor on three different tasks.
arXiv Detail & Related papers (2020-06-15T13:17:18Z) - Semantic Object Prediction and Spatial Sound Super-Resolution with
Binaural Sounds [106.87299276189458]
Humans can robustly recognize and localize objects by integrating visual and auditory cues.
This work develops an approach for dense semantic labelling of sound-making objects, purely based on sounds.
arXiv Detail & Related papers (2020-03-09T15:49:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.