Robustness of Speech Separation Models for Similar-pitch Speakers
- URL: http://arxiv.org/abs/2407.15749v1
- Date: Mon, 22 Jul 2024 15:55:08 GMT
- Title: Robustness of Speech Separation Models for Similar-pitch Speakers
- Authors: Bunlong Lay, Sebastian Zaczek, Kristina Tesch, Timo Gerkmann,
- Abstract summary: Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments.
This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal.
- Score: 14.941946672578863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Single-channel speech separation is a crucial task for enhancing speech recognition systems in multi-speaker environments. This paper investigates the robustness of state-of-the-art Neural Network models in scenarios where the pitch differences between speakers are minimal. Building on earlier findings by Ditter and Gerkmann, which identified a significant performance drop for the 2018 Chimera++ under similar-pitch conditions, our study extends the analysis to more recent and sophisticated Neural Network models. Our experiments reveal that modern models have substantially reduced the performance gap for matched training and testing conditions. However, a substantial performance gap persists under mismatched conditions, with models performing well for large pitch differences but showing worse performance if the speakers' pitches are similar. These findings motivate further research into the generalizability of speech separation models to similar-pitch speakers and unseen data.
Related papers
- Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions
Using a Heun-Based Sampler [16.13996677489119]
Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully.
Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models.
We show that a proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions.
arXiv Detail & Related papers (2023-12-05T11:40:38Z) - ELF: Encoding Speaker-Specific Latent Speech Feature for Speech Synthesis [5.824018496599849]
We propose a novel method for modeling numerous speakers.
It enables expressing the overall characteristics of speakers in detail like a trained multi-speaker model.
arXiv Detail & Related papers (2023-11-20T13:13:24Z) - Machine learning of network inference enhancement from noisy measurements [13.0533106097336]
Inferring networks from observed time series data presents a clear glimpse into the interconnections among nodes.
Network inference models, when dealing with real-world open cases, experience a sharp decline in performance.
We present an elegant and efficient model-agnostic framework tailored to amplify the capabilities of model-based and model-free network inference models.
arXiv Detail & Related papers (2023-09-05T08:51:40Z) - Analysing the Impact of Audio Quality on the Use of Naturalistic
Long-Form Recordings for Infant-Directed Speech Research [62.997667081978825]
Modelling of early language acquisition aims to understand how infants bootstrap their language skills.
Recent developments have enabled the use of more naturalistic training data for computational models.
It is currently unclear how the sound quality could affect analyses and modelling experiments conducted on such data.
arXiv Detail & Related papers (2023-05-03T08:25:37Z) - Automatic Evaluation of Speaker Similarity [0.0]
We introduce a new automatic evaluation method for speaker similarity assessment, consistent with human perceptual scores.
Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.
arXiv Detail & Related papers (2022-07-01T11:23:16Z) - Self-supervised models of audio effectively explain human cortical
responses to speech [71.57870452667369]
We capitalize on the progress of self-supervised speech representation learning to create new state-of-the-art models of the human auditory system.
We show that these results show that self-supervised models effectively capture the hierarchy of information relevant to different stages of speech processing in human cortex.
arXiv Detail & Related papers (2022-05-27T22:04:02Z) - SepIt: Approaching a Single Channel Speech Separation Bound [99.19786288094596]
We introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation.
In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
arXiv Detail & Related papers (2022-05-24T05:40:36Z) - Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.