BatVision with GCC-PHAT Features for Better Sound to Vision Predictions
- URL: http://arxiv.org/abs/2006.07995v1
- Date: Sun, 14 Jun 2020 19:49:58 GMT
- Title: BatVision with GCC-PHAT Features for Better Sound to Vision Predictions
- Authors: Jesper Haahr Christensen, Sascha Hornauer, Stella Yu
- Abstract summary: We train a generative adversarial network to predict plausible depth maps and grayscale layouts from sound.
We build upon previous work with BatVision that consists of a soundto-vision model and a self-collected dataset.
- Score: 5.9514420658483935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Inspired by sophisticated echolocation abilities found in nature, we train a
generative adversarial network to predict plausible depth maps and grayscale
layouts from sound. To achieve this, our sound-to-vision model processes
binaural echo-returns from chirping sounds. We build upon previous work with
BatVision that consists of a sound-to-vision model and a self-collected dataset
using our mobile robot and low-cost hardware. We improve on the previous model
by introducing several changes to the model, which leads to a better depth and
grayscale estimation, and increased perceptual quality. Rather than using raw
binaural waveforms as input, we generate generalized cross-correlation (GCC)
features and use these as input instead. In addition, we change the model
generator and base it on residual learning and use spectral normalization in
the discriminator. We compare and present both quantitative and qualitative
improvements over our previous BatVision model.
Related papers
- FoundationStereo: Zero-Shot Stereo Matching [50.79202911274819]
FoundationStereo is a foundation model for stereo depth estimation.
We first construct a large-scale (1M stereo pairs) synthetic training dataset.
We then design a number of network architecture components to enhance scalability.
arXiv Detail & Related papers (2025-01-17T01:01:44Z) - Deepfake Audio Detection Using Spectrogram-based Feature and Ensemble of Deep Learning Models [42.39774323584976]
We propose a deep learning based system for the task of deepfake audio detection.
In particular, the draw input audio is first transformed into various spectrograms.
We leverage the state-of-the-art audio pre-trained models of Whisper, Seamless, Speechbrain, and Pyannote to extract audio embeddings.
arXiv Detail & Related papers (2024-07-01T20:10:43Z) - Neural Residual Diffusion Models for Deep Scalable Vision Generation [17.931568104324985]
We propose a unified and massively scalable Neural Residual Diffusion Models framework (Neural-RDM)
The proposed neural residual models obtain state-of-the-art scores on image's and video's generative benchmarks.
arXiv Detail & Related papers (2024-06-19T04:57:18Z) - Variational Positive-incentive Noise: How Noise Benefits Models [84.67629229767047]
We investigate how to benefit the classical models by random noise under the framework of Positive-incentive Noise (Pi-Noise)
Since the ideal objective of Pi-Noise is intractable, we propose to optimize its variational bound instead, namely variational Pi-Noise (VPN)
arXiv Detail & Related papers (2023-06-13T09:43:32Z) - Audio-visual speech enhancement with a deep Kalman filter generative
model [0.0]
We present an audiovisual deep Kalman filter (AV-DKF) generative model which assumes a first-order Markov chain model for the latent variables.
We develop an efficient inference methodology to estimate speech signals at test time.
arXiv Detail & Related papers (2022-11-02T09:50:08Z) - STAR: Sparse Transformer-based Action Recognition [61.490243467748314]
This work proposes a novel skeleton-based human action recognition model with sparse attention on the spatial dimension and segmented linear attention on the temporal dimension of data.
Experiments show that our model can achieve comparable performance while utilizing much less trainable parameters and achieve high speed in training and inference.
arXiv Detail & Related papers (2021-07-15T02:53:11Z) - Switching Variational Auto-Encoders for Noise-Agnostic Audio-visual
Speech Enhancement [26.596930749375474]
We introduce the use of a latent sequential variable with Markovian dependencies to switch between different VAE architectures through time.
We derive the corresponding variational expectation-maximization algorithm to estimate the parameters of the model and enhance the speech signal.
arXiv Detail & Related papers (2021-02-08T11:45:02Z) - Anomaly Detection of Time Series with Smoothness-Inducing Sequential
Variational Auto-Encoder [59.69303945834122]
We present a Smoothness-Inducing Sequential Variational Auto-Encoder (SISVAE) model for robust estimation and anomaly detection of time series.
Our model parameterizes mean and variance for each time-stamp with flexible neural networks.
We show the effectiveness of our model on both synthetic datasets and public real-world benchmarks.
arXiv Detail & Related papers (2021-02-02T06:15:15Z) - Neural PLDA Modeling for End-to-End Speaker Verification [40.842070706362534]
We propose a neural network approach for backend modeling in speaker verification called the neural PLDA (NPLDA)
In this paper, we extend this work to achieve joint optimization of the embedding neural network (x-vector network) with the NPLDA network in an end-to-end fashion.
We show that the proposed E2E model improves significantly over the x-vector PLDA baseline speaker verification system.
arXiv Detail & Related papers (2020-08-11T05:54:54Z) - TERA: Self-Supervised Learning of Transformer Encoder Representation for
Speech [63.03318307254081]
TERA stands for Transformer Representations from Alteration.
We use alteration along three axes to pre-train Transformers on a large amount of unlabeled speech.
TERA can be used for speech representations extraction or fine-tuning with downstream models.
arXiv Detail & Related papers (2020-07-12T16:19:00Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.