On Data Sampling Strategies for Training Neural Network Speech
Separation Models
- URL: http://arxiv.org/abs/2304.07142v2
- Date: Fri, 16 Jun 2023 13:42:10 GMT
- Title: On Data Sampling Strategies for Training Neural Network Speech
Separation Models
- Authors: William Ravenscroft and Stefan Goetze and Thomas Hain
- Abstract summary: Speech separation is an important area of multi-speaker signal processing.
Deep neural network (DNN) models have attained the best performance on many speech separation benchmarks.
Some of these models can take significant time to train and have high memory requirements.
Previous work has proposed shortening training examples to address these issues but the impact of this on model performance is not yet well understood.
- Score: 26.94528951545861
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Speech separation remains an important area of multi-speaker signal
processing. Deep neural network (DNN) models have attained the best performance
on many speech separation benchmarks. Some of these models can take significant
time to train and have high memory requirements. Previous work has proposed
shortening training examples to address these issues but the impact of this on
model performance is not yet well understood. In this work, the impact of
applying these training signal length (TSL) limits is analysed for two speech
separation models: SepFormer, a transformer model, and Conv-TasNet, a
convolutional model. The WJS0-2Mix, WHAMR and Libri2Mix datasets are analysed
in terms of signal length distribution and its impact on training efficiency.
It is demonstrated that, for specific distributions, applying specific TSL
limits results in better performance. This is shown to be mainly due to
randomly sampling the start index of the waveforms resulting in more unique
examples for training. A SepFormer model trained using a TSL limit of 4.42s and
dynamic mixing (DM) is shown to match the best-performing SepFormer model
trained with DM and unlimited signal lengths. Furthermore, the 4.42s TSL limit
results in a 44% reduction in training time with WHAMR.
Related papers
- Transferable Post-training via Inverse Value Learning [83.75002867411263]
We propose modeling changes at the logits level during post-training using a separate neural network (i.e., the value network)
After training this network on a small base model using demonstrations, this network can be seamlessly integrated with other pre-trained models during inference.
We demonstrate that the resulting value network has broad transferability across pre-trained models of different parameter sizes.
arXiv Detail & Related papers (2024-10-28T13:48:43Z) - TRAK: Attributing Model Behavior at Scale [79.56020040993947]
We present TRAK (Tracing with Randomly-trained After Kernel), a data attribution method that is both effective and computationally tractable for large-scale, differenti models.
arXiv Detail & Related papers (2023-03-24T17:56:22Z) - MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module [3.42658286826597]
We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS)
We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding.
We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances.
arXiv Detail & Related papers (2023-01-17T18:53:15Z) - Training Trajectories of Language Models Across Scales [99.38721327771208]
Scaling up language models has led to unprecedented performance gains.
How do language models of different sizes learn during pre-training?
Why do larger language models demonstrate more desirable behaviors?
arXiv Detail & Related papers (2022-12-19T19:16:29Z) - Model Extraction Attack against Self-supervised Speech Models [52.81330435990717]
Self-supervised learning (SSL) speech models generate meaningful representations of given clips.
Model extraction attack (MEA) often refers to an adversary stealing the functionality of the victim model with only query access.
We study the MEA problem against SSL speech model with a small number of queries.
arXiv Detail & Related papers (2022-11-29T09:28:05Z) - Evidence of Vocal Tract Articulation in Self-Supervised Learning of
Speech [15.975756437343742]
Recent self-supervised learning (SSL) models have proven to learn rich representations of speech.
We conduct a comprehensive analysis to link speech representations to articulatory trajectories measured by electromagnetic articulography (EMA)
Our findings suggest that SSL models learn to align closely with continuous articulations, and provide a novel insight into speech SSL.
arXiv Detail & Related papers (2022-10-21T04:24:29Z) - Match to Win: Analysing Sequences Lengths for Efficient Self-supervised
Learning in Speech and Audio [19.865050806327147]
Self-supervised learning has proven vital in speech and audio-related applications.
This paper provides the first empirical study of SSL pre-training for different specified sequence lengths.
We find that training on short sequences can dramatically reduce resource costs while retaining a satisfactory performance for all tasks.
arXiv Detail & Related papers (2022-09-30T16:35:42Z) - METRO: Efficient Denoising Pretraining of Large Scale Autoencoding
Language Models with Model Generated Signals [151.3601429216877]
We present an efficient method of pretraining large-scale autoencoding language models using training signals generated by an auxiliary model.
We propose a recipe, namely "Model generated dEnoising TRaining Objective" (METRO)
The resultant models, METRO-LM, consisting of up to 5.4 billion parameters, achieve new state-of-the-art on the GLUE, SuperGLUE, and SQuAD benchmarks.
arXiv Detail & Related papers (2022-04-13T21:39:15Z) - Tackling the Problem of Limited Data and Annotations in Semantic
Segmentation [1.0152838128195467]
To tackle the problem of limited data annotations in image segmentation, different pre-trained models and CRF based methods are applied.
To this end, RotNet, DeeperCluster, and Semi&Weakly Supervised Learning (SWSL) pre-trained models are transferred and finetuned in a DeepLab-v2 baseline.
The results of my study show that, on this small dataset, using a pre-trained ResNet50 SWSL model gives results that are 7.4% better than applying an ImageNet pre-trained model.
arXiv Detail & Related papers (2020-07-14T21:11:11Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.