Time-Domain Mapping Based Single-Channel Speech Separation With
Hierarchical Constraint Training
- URL: http://arxiv.org/abs/2110.10593v1
- Date: Wed, 20 Oct 2021 14:42:50 GMT
- Title: Time-Domain Mapping Based Single-Channel Speech Separation With
Hierarchical Constraint Training
- Authors: Chenyang Gao, Yue Gu, and Ivan Marsic
- Abstract summary: Single-channel speech separation is required for multi-speaker speech recognition.
Recent deep learning-based approaches focused on time-domain audio separation net (TasNet)
We introduce attention augmented DPRNN (AttnAugDPRNN) which directly approximates the clean sources from the mixture for speech separation.
- Score: 10.883458728718047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Single-channel speech separation is required for multi-speaker speech
recognition. Recent deep learning-based approaches focused on time-domain audio
separation net (TasNet) because it has superior performance and lower latency
compared to the conventional time-frequency-based (T-F-based) approaches. Most
of these works rely on the masking-based method that estimates a linear mapping
function (mask) for each speaker. However, the other commonly used method, the
mapping-based method that is less sensitive to SNR variations, is inadequately
studied in the time domain. We explore the potential of the mapping-based
method by introducing attention augmented DPRNN (AttnAugDPRNN) which directly
approximates the clean sources from the mixture for speech separation.
Permutation Invariant Training (PIT) has been a paradigm to solve the label
ambiguity problem for speech separation but usually leads to suboptimal
performance. To solve this problem, we propose an efficient training strategy
called Hierarchical Constraint Training (HCT) to regularize the training, which
could effectively improve the model performance. When using PIT, our results
showed that mapping-based AttnAugDPRNN outperformed masking-based AttnAugDPRNN
when the training corpus is large. Mapping-based AttnAugDPRNN with HCT
significantly improved the SI-SDR by 10.1% compared to the masking-based
AttnAugDPRNN without HCT.
Related papers
- Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising [15.152748065111194]
This paper describes speech enhancement for realtime automatic speech recognition in real environments.
It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming.
The performance of such a supervised approach, however, is drastically degraded under mismatched conditions.
arXiv Detail & Related papers (2024-10-30T08:32:47Z) - Policy Gradient-Driven Noise Mask [3.69758875412828]
We propose a novel pretraining pipeline that learns to generate conditional noise masks specifically tailored to improve performance on multi-modal and multi-organ datasets.
A key aspect is that the policy network's role is limited to obtaining an intermediate (or heated) model before fine-tuning.
Results demonstrate that fine-tuning the intermediate models consistently outperforms conventional training algorithms on both classification and generalization to unseen concept tasks.
arXiv Detail & Related papers (2024-04-29T23:53:42Z) - Efficient Ensemble for Multimodal Punctuation Restoration using
Time-Delay Neural Network [1.006218778776515]
Punctuation restoration plays an essential role in the post-processing procedure of automatic speech recognition.
We present EfficientPunct, an ensemble method with a multimodal time-delay neural network.
It outperforms the current best model by 1.0 F1 points, using less than a tenth of its inference network parameters.
arXiv Detail & Related papers (2023-02-26T18:28:20Z) - A DNN based Normalized Time-frequency Weighted Criterion for Robust
Wideband DoA Estimation [24.175086158375464]
We propose a normalized time-frequency weighted criterion which minimizes the distance between the candidate steering vectors and the filtered snapshots in the T-F domain.
Our method requires no eigendecomposition and uses a simple normalization to prevent the optimization objective from being misled by noisy snapshots.
Experiments show that the proposed method outperforms popular DNN based DoA estimation methods including widely used subspace methods in noisy and reverberant environments.
arXiv Detail & Related papers (2023-02-20T18:26:52Z) - Single-channel speech separation using Soft-minimum Permutation
Invariant Training [60.99112031408449]
A long-lasting problem in supervised speech separation is finding the correct label for each separated speech signal.
Permutation Invariant Training (PIT) has been shown to be a promising solution in handling the label ambiguity problem.
In this work, we propose a probabilistic optimization framework to address the inefficiency of PIT in finding the best output-label assignment.
arXiv Detail & Related papers (2021-11-16T17:25:05Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Communication-Efficient Distributed Stochastic AUC Maximization with
Deep Neural Networks [50.42141893913188]
We study a distributed variable for large-scale AUC for a neural network as with a deep neural network.
Our model requires a much less number of communication rounds and still a number of communication rounds in theory.
Our experiments on several datasets show the effectiveness of our theory and also confirm our theory.
arXiv Detail & Related papers (2020-05-05T18:08:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.