Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural
Networks
- URL: http://arxiv.org/abs/2006.09006v2
- Date: Wed, 16 Dec 2020 19:07:34 GMT
- Title: Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural
Networks
- Authors: David Diaz-Guerra, Antonio Miguel and Jose R. Beltran
- Abstract summary: We present a new single sound source DOA estimation and tracking system based on the SRP-PHAT algorithm and a three-dimensional Convolutional Neural Network.
It uses SRP-PHAT power maps as input features of a fully convolutional causal architecture that uses 3D convolutional layers to accurately perform the tracking of a sound source.
- Score: 10.089520556398574
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a new single sound source DOA estimation and
tracking system based on the well-known SRP-PHAT algorithm and a
three-dimensional Convolutional Neural Network. It uses SRP-PHAT power maps as
input features of a fully convolutional causal architecture that uses 3D
convolutional layers to accurately perform the tracking of a sound source even
in highly reverberant scenarios where most of the state of the art techniques
fail. Unlike previous methods, since we do not use bidirectional recurrent
layers and all our convolutional layers are causal in the time dimension, our
system is feasible for real-time applications and it provides a new DOA
estimation for each new SRP-PHAT map. To train the model, we introduce a new
procedure to simulate random trajectories as they are needed during the
training, equivalent to an infinite-size dataset with high flexibility to
modify its acoustical conditions such as the reverberation time. We use both
acoustical simulations on a large range of reverberation times and the actual
recordings of the LOCATA dataset to prove the robustness of our system and its
good performance even using low-resolution SRP-PHAT maps.
Related papers
- KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter [49.85369344101118]
We introduce KFD-NeRF, a novel dynamic neural radiance field integrated with an efficient and high-quality motion reconstruction framework based on Kalman filtering.
Our key idea is to model the dynamic radiance field as a dynamic system whose temporally varying states are estimated based on two sources of knowledge: observations and predictions.
Our KFD-NeRF demonstrates similar or even superior performance within comparable computational time and state-of-the-art view synthesis performance with thorough training.
arXiv Detail & Related papers (2024-07-18T05:48:24Z) - ResFields: Residual Neural Fields for Spatiotemporal Signals [61.44420761752655]
ResFields is a novel class of networks specifically designed to effectively represent complex temporal signals.
We conduct comprehensive analysis of the properties of ResFields and propose a matrix factorization technique to reduce the number of trainable parameters.
We demonstrate the practical utility of ResFields by showcasing its effectiveness in capturing dynamic 3D scenes from sparse RGBD cameras.
arXiv Detail & Related papers (2023-09-06T16:59:36Z) - SeMLaPS: Real-time Semantic Mapping with Latent Prior Networks and
Quasi-Planar Segmentation [53.83313235792596]
We present a new methodology for real-time semantic mapping from RGB-D sequences.
It combines a 2D neural network and a 3D network based on a SLAM system with 3D occupancy mapping.
Our system achieves state-of-the-art semantic mapping quality within 2D-3D networks-based systems.
arXiv Detail & Related papers (2023-06-28T22:36:44Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - WNet: A data-driven dual-domain denoising model for sparse-view computed
tomography with a trainable reconstruction layer [3.832032989515628]
We propose WNet, a data-driven dual-domain denoising model which contains a trainable reconstruction layer for sparse-view artifact denoising.
We train and test our network on two clinically relevant datasets and we compare the obtained results with three different types of sparse-view CT denoising and reconstruction algorithms.
arXiv Detail & Related papers (2022-07-01T13:17:01Z) - Time-Frequency Localization Using Deep Convolutional Maxout Neural
Network in Persian Speech Recognition [0.0]
Time-frequency flexibility in some mammals' auditory neurons system improves recognition performance.
This paper proposes a CNN-based structure for time-frequency localization of audio signal information in the ASR acoustic model.
The average recognition score of TFCMNN models is about 1.6% higher than the average of conventional models.
arXiv Detail & Related papers (2021-08-09T05:46:58Z) - Multi-view Depth Estimation using Epipolar Spatio-Temporal Networks [87.50632573601283]
We present a novel method for multi-view depth estimation from a single video.
Our method achieves temporally coherent depth estimation results by using a novel Epipolar Spatio-Temporal (EST) transformer.
To reduce the computational cost, inspired by recent Mixture-of-Experts models, we design a compact hybrid network.
arXiv Detail & Related papers (2020-11-26T04:04:21Z) - Inferring, Predicting, and Denoising Causal Wave Dynamics [3.9407250051441403]
The DISTributed Artificial neural Network Architecture (DISTANA) is a generative, recurrent graph convolution neural network.
We show that DISTANA is very well-suited to denoise data streams, given that re-occurring patterns are observed.
It produces stable and accurate closed-loop predictions even over hundreds of time steps.
arXiv Detail & Related papers (2020-09-19T08:33:53Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.