Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation
- URL: http://arxiv.org/abs/2001.00391v1
- Date: Thu, 2 Jan 2020 11:12:50 GMT
- Title: Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation
- Authors: Rongzhi Gu and Yuexian Zou
- Abstract summary: Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
- Score: 66.46123655365113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Target speech separation refers to extracting the target speaker's speech
from mixed signals. Despite the recent advances in deep learning based
close-talk speech separation, the applications to real-world are still an open
issue. Two main challenges are the complex acoustic environment and the
real-time processing requirement. To address these challenges, we propose a
temporal-spatial neural filter, which directly estimates the target speech
waveform from multi-speaker mixture in reverberant environments, assisted with
directional information of the speaker(s). Firstly, against variations brought
by complex environment, the key idea is to increase the acoustic representation
completeness through the jointly modeling of temporal, spectral and spatial
discriminability between the target and interference source. Specifically,
temporal, spectral, spatial along with the designed directional features are
integrated to create a joint acoustic representation. Secondly, to reduce the
latency, we design a fully-convolutional autoencoder framework, which is purely
end-to-end and single-pass. All the feature computation is implemented by the
network layers and operations to speed up the separation procedure. Evaluation
is conducted on simulated reverberant dataset WSJ0-2mix and WSJ0-3mix under
speaker-independent scenario. Experimental results demonstrate that the
proposed method outperforms state-of-the-art deep learning based multi-channel
approaches with fewer parameters and faster processing speed. Furthermore, the
proposed temporal-spatial neural filter can handle mixtures with varying and
unknown number of speakers and exhibits persistent performance even when
existing a direction estimation error. Codes and models will be released soon.
Related papers
- TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation [19.126525226518975]
We propose a speech separation model with significantly reduced parameters and computational costs.
TIGER leverages prior knowledge to divide frequency bands and compresses frequency information.
We show that TIGER achieves performance surpassing state-of-the-art (SOTA) model TF-GridNet.
arXiv Detail & Related papers (2024-10-02T12:21:06Z) - Attention-Driven Multichannel Speech Enhancement in Moving Sound Source
Scenarios [11.811571392419324]
Speech enhancement algorithms typically assume a stationary sound source, a common mismatch with reality that limits their performance in real-world scenarios.
This paper focuses on attention-driven spatial filtering techniques designed for dynamic settings.
arXiv Detail & Related papers (2023-12-17T16:12:35Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Direction-Aware Adaptive Online Neural Speech Enhancement with an
Augmented Reality Headset in Real Noisy Conversational Environments [21.493664174262737]
This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset.
It helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party)
The method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker.
arXiv Detail & Related papers (2022-07-15T05:14:27Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - On End-to-end Multi-channel Time Domain Speech Separation in Reverberant
Environments [33.79711018198589]
This paper introduces a new method for multi-channel time domain speech separation in reverberant environments.
A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings.
To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied.
arXiv Detail & Related papers (2020-11-11T18:25:07Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.