Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in
High-order Latent Domain
- URL: http://arxiv.org/abs/2110.04791v1
- Date: Sun, 10 Oct 2021 13:21:16 GMT
- Title: Stepwise-Refining Speech Separation Network via Fine-Grained Encoding in
High-order Latent Domain
- Authors: Zengwei Yao, Wenjie Pei, Fanglin Chen, Guangming Lu, and David Zhang
- Abstract summary: We propose the Stepwise-Refining Speech Separation Network (SRSSN), which follows a coarse-to-fine separation framework.
It first learns a 1-order latent domain to define an encoding space and thereby performs a rough separation in the coarse phase.
It then learns a new latent domain along each basis function of the existing latent domain to obtain a high-order latent domain in the refining phase.
- Score: 34.23260020137834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The crux of single-channel speech separation is how to encode the mixture of
signals into such a latent embedding space that the signals from different
speakers can be precisely separated. Existing methods for speech separation
either transform the speech signals into frequency domain to perform separation
or seek to learn a separable embedding space by constructing a latent domain
based on convolutional filters. While the latter type of methods learning an
embedding space achieves substantial improvement for speech separation, we
argue that the embedding space defined by only one latent domain does not
suffice to provide a thoroughly separable encoding space for speech separation.
In this paper, we propose the Stepwise-Refining Speech Separation Network
(SRSSN), which follows a coarse-to-fine separation framework. It first learns a
1-order latent domain to define an encoding space and thereby performs a rough
separation in the coarse phase. Then the proposed SRSSN learns a new latent
domain along each basis function of the existing latent domain to obtain a
high-order latent domain in the refining phase, which enables our model to
perform a refining separation to achieve a more precise speech separation. We
demonstrate the effectiveness of our SRSSN by conducting extensive experiments,
including speech separation in a clean (noise-free) setting on WSJ0-2/3mix
datasets as well as in noisy/reverberant settings on WHAM!/WHAMR! datasets.
Furthermore, we also perform experiments of speech recognition on separated
speech signals by our model to evaluate the performance of speech separation
indirectly.
Related papers
- Speech Separation with Pretrained Frontend to Minimize Domain Mismatch [42.63061599979695]
Speech separation seeks to separate individual speech signals from a speech mixture.
Most separation models are trained on synthetic data due to the unavailability of target reference in real-world party scenarios.
We propose a self-supervised domain-invariant pretrained (DIP) that is exposed to mixture data without the need for target reference speech.
arXiv Detail & Related papers (2024-11-05T13:30:27Z) - Mixture Encoder Supporting Continuous Speech Separation for Meeting
Recognition [15.610658840718607]
We propose a mixture encoder to mitigate the effect of artifacts introduced by the speech separation.
We extend this approach to more natural meeting contexts featuring an arbitrary number of speakers and dynamic overlaps.
Our experiments show state-of-the-art performance on the LibriCSS dataset and highlight the advantages of the mixture encoder.
arXiv Detail & Related papers (2023-09-15T14:57:28Z) - Multi-channel Speech Separation Using Spatially Selective Deep
Non-linear Filters [21.672683390080106]
In a multi-channel separation task with multiple speakers, we aim to recover all individual speech signals from the mixture.
We propose a deep neural network based spatially selective filter (SSF) that can be spatially steered to extract the speaker of interest.
arXiv Detail & Related papers (2023-04-24T11:44:00Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - Continuous Speech Separation with Ad Hoc Microphone Arrays [35.87274524040486]
Speech separation has been shown effective for multi-talker speech recognition.
In this paper, we extend this approach to continuous speech separation.
Two methods are proposed to mitigate a speech problem during single talker segments.
arXiv Detail & Related papers (2021-03-03T13:01:08Z) - DEAAN: Disentangled Embedding and Adversarial Adaptation Network for
Robust Speaker Representation Learning [69.70594547377283]
We propose a novel framework to disentangle speaker-related and domain-specific features.
Our framework can effectively generate more speaker-discriminative and domain-invariant speaker representations.
arXiv Detail & Related papers (2020-12-12T19:46:56Z) - On End-to-end Multi-channel Time Domain Speech Separation in Reverberant
Environments [33.79711018198589]
This paper introduces a new method for multi-channel time domain speech separation in reverberant environments.
A fully-convolutional neural network structure has been used to directly separate speech from multiple microphone recordings.
To reduce the influence of reverberation on spatial feature extraction, a dereverberation pre-processing method has been applied.
arXiv Detail & Related papers (2020-11-11T18:25:07Z) - Continuous Speech Separation with Conformer [60.938212082732775]
We use transformer and conformer in lieu of recurrent neural networks in the separation system.
We believe capturing global information with the self-attention based method is crucial for the speech separation.
arXiv Detail & Related papers (2020-08-13T09:36:05Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.