Implicit Neural Spatial Filtering for Multichannel Source Separation in
the Waveform Domain
- URL: http://arxiv.org/abs/2206.15423v1
- Date: Thu, 30 Jun 2022 17:13:01 GMT
- Title: Implicit Neural Spatial Filtering for Multichannel Source Separation in
the Waveform Domain
- Authors: Dejan Markovic, Alexandre Defossez, Alexander Richard
- Abstract summary: The model is trained end-to-end and performs spatial processing implicitly.
We evaluate the proposed model on a real-world dataset and show that the model matches the performance of an oracle beamformer.
- Score: 131.74762114632404
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present a single-stage casual waveform-to-waveform multichannel model that
can separate moving sound sources based on their broad spatial locations in a
dynamic acoustic scene. We divide the scene into two spatial regions
containing, respectively, the target and the interfering sound sources. The
model is trained end-to-end and performs spatial processing implicitly, without
any components based on traditional processing or use of hand-crafted spatial
features. We evaluate the proposed model on a real-world dataset and show that
the model matches the performance of an oracle beamformer followed by a
state-of-the-art single-channel enhancement network.
Related papers
- ImmerseDiffusion: A Generative Spatial Audio Latent Diffusion Model [2.2927722373373247]
We introduce ImmerseDiffusion, an end-to-end generative audio model that produces 3D immersive soundscapes conditioned on the spatial, temporal, and environmental conditions of sound objects.
arXiv Detail & Related papers (2024-10-19T02:28:53Z) - Spatial-Frequency U-Net for Denoising Diffusion Probabilistic Models [89.76587063609806]
We study the denoising diffusion probabilistic model (DDPM) in wavelet space, instead of pixel space, for visual synthesis.
By explicitly modeling the wavelet signals, we find our model is able to generate images with higher quality on several datasets.
arXiv Detail & Related papers (2023-07-27T06:53:16Z) - Multi-Microphone Speaker Separation by Spatial Regions [9.156939957189504]
We consider the task of region-based source separation of reverberant multi-microphone recordings.
We propose a data-driven approach using a modified version of a state-of-the-art network.
We show that both training methods result in a fixed mapping of regions to network outputs, achieve comparable performance, and that the networks exploit spatial information.
arXiv Detail & Related papers (2023-03-13T14:11:34Z) - On Neural Architectures for Deep Learning-based Source Separation of
Co-Channel OFDM Signals [104.11663769306566]
We study the single-channel source separation problem involving frequency-division multiplexing (OFDM) signals.
We propose critical domain-informed modifications to the network parameterization, based on insights from OFDM structures.
arXiv Detail & Related papers (2023-03-11T16:29:13Z) - BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z) - Coarse-to-Fine Video Denoising with Dual-Stage Spatial-Channel
Transformer [29.03463312813923]
Video denoising aims to recover high-quality frames from the noisy video.
Most existing approaches adopt convolutional neural networks(CNNs) to separate the noise from the original visual content.
We propose a Dual-stage Spatial-Channel Transformer (DSCT) for coarse-to-fine video denoising.
arXiv Detail & Related papers (2022-04-30T09:01:21Z) - Learning Signal-Agnostic Manifolds of Neural Fields [50.066449953522685]
We leverage neural fields to capture the underlying structure in image, shape, audio and cross-modal audiovisual domains.
We show that by walking across the underlying manifold of GEM, we may generate new samples in our signal domains.
arXiv Detail & Related papers (2021-11-11T18:57:40Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.