Neural Fourier Shift for Binaural Speech Rendering
- URL: http://arxiv.org/abs/2211.00878v2
- Date: Mon, 1 May 2023 10:57:50 GMT
- Title: Neural Fourier Shift for Binaural Speech Rendering
- Authors: Jin Woo Lee, Kyogu Lee
- Abstract summary: We present a neural network for rendering speech from given monaural audio, position, and orientation of the source.
We propose Neural Shift (NFS), a novel network architecture that enables speech rendering in the Fourier space.
- Score: 16.957415282256758
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a neural network for rendering binaural speech from given monaural
audio, position, and orientation of the source. Most of the previous works have
focused on synthesizing binaural speeches by conditioning the positions and
orientations in the feature space of convolutional neural networks. These
synthesis approaches are powerful in estimating the target binaural speeches
even for in-the-wild data but are difficult to generalize for rendering the
audio from out-of-distribution domains. To alleviate this, we propose Neural
Fourier Shift (NFS), a novel network architecture that enables binaural speech
rendering in the Fourier space. Specifically, utilizing a geometric time delay
based on the distance between the source and the receiver, NFS is trained to
predict the delays and scales of various early reflections. NFS is efficient in
both memory and computational cost, is interpretable, and operates
independently of the source domain by its design. Experimental results show
that NFS performs comparable to the previous studies on the benchmark dataset,
even with its 25 times lighter memory and 6 times fewer calculations.
Related papers
- Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis [1.4277428617774877]
We present Vocos, a new model that directly generates Fourier spectral coefficients.
It substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches.
arXiv Detail & Related papers (2023-06-01T15:40:32Z) - NAF: Neural Attenuation Fields for Sparse-View CBCT Reconstruction [79.13750275141139]
This paper proposes a novel and fast self-supervised solution for sparse-view CBCT reconstruction.
The desired attenuation coefficients are represented as a continuous function of 3D spatial coordinates, parameterized by a fully-connected deep neural network.
A learning-based encoder entailing hash coding is adopted to help the network capture high-frequency details.
arXiv Detail & Related papers (2022-09-29T04:06:00Z) - Variable Bitrate Neural Fields [75.24672452527795]
We present a dictionary method for compressing feature grids, reducing their memory consumption by up to 100x.
We formulate the dictionary optimization as a vector-quantized auto-decoder problem which lets us learn end-to-end discrete neural representations in a space where no direct supervision is available.
arXiv Detail & Related papers (2022-06-15T17:58:34Z) - FFC-SE: Fast Fourier Convolution for Speech Enhancement [1.0499611180329804]
Fast Fourier convolution (FFC) is the recently proposed neural operator showing promising performance in several computer vision problems.
In this work, we design neural network architectures which adapt FFC for speech enhancement.
We found that neural networks based on FFC outperform analogous convolutional models and show better or comparable results with other speech enhancement baselines.
arXiv Detail & Related papers (2022-04-06T18:52:47Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - DeepA: A Deep Neural Analyzer For Speech And Singing Vocoding [71.73405116189531]
We propose a neural vocoder that extracts F0 and timbre/aperiodicity encoding from the input speech that emulates those defined in conventional vocoders.
As the deep neural analyzer is learnable, it is expected to be more accurate for signal reconstruction and manipulation, and generalizable from speech to singing.
arXiv Detail & Related papers (2021-10-13T01:39:57Z) - PILOT: Introducing Transformers for Probabilistic Sound Event
Localization [107.78964411642401]
This paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms.
The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy.
arXiv Detail & Related papers (2021-06-07T18:29:19Z) - Frequency Gating: Improved Convolutional Neural Networks for Speech
Enhancement in the Time-Frequency Domain [37.722450363816144]
We introduce a method, which we call Frequency Gating, to compute multiplicative weights for the kernels of the CNN.
Experiments with an autoencoder neural network with skip connections show that both local and frequency-wise gating outperform the baseline.
A loss function based on the extended short-time objective intelligibility score (ESTOI) is introduced, which we show to outperform the standard mean squared error (MSE) loss function.
arXiv Detail & Related papers (2020-11-08T22:04:00Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.