Monaural Multi-Speaker Speech Separation Using Efficient Transformer
Model
- URL: http://arxiv.org/abs/2308.00010v1
- Date: Sat, 29 Jul 2023 15:10:46 GMT
- Title: Monaural Multi-Speaker Speech Separation Using Efficient Transformer
Model
- Authors: S. Rijal, R. Neupane, S. P. Mainali, S. K. Regmi and S. Maharjan
- Abstract summary: "Monaural multi-speaker speech separation" presents a speech-separation model based on the Transformer architecture and its efficient forms.
The model has been trained with the LibriMix dataset containing diverse speakers' utterances.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Cocktail party problem is the scenario where it is difficult to separate or
distinguish individual speaker from a mixed speech from several speakers. There
have been several researches going on in this field but the size and complexity
of the model is being traded off with the accuracy and robustness of speech
separation. "Monaural multi-speaker speech separation" presents a
speech-separation model based on the Transformer architecture and its efficient
forms. The model has been trained with the LibriMix dataset containing diverse
speakers' utterances. The model separates 2 distinct speaker sources from a
mixed audio input. The developed model approaches the reduction in
computational complexity of the speech separation model, with minimum tradeoff
with the performance of prevalent speech separation model and it has shown
significant movement towards that goal. This project foresees, a rise in
contribution towards the ongoing research in the field of speech separation
with computational efficiency at its core.
Related papers
- TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion.
We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process.
Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Mixture Encoder for Joint Speech Separation and Recognition [15.13598115379631]
Multi-speaker automatic speech recognition is crucial for many real-world applications.
Existing approaches can be divided into modular and end-to-end methods.
End-to-end models process overlapped speech directly in a single, powerful neural network.
arXiv Detail & Related papers (2023-06-21T11:01:31Z) - Multi-Dimensional and Multi-Scale Modeling for Speech Separation
Optimized by Discriminative Learning [9.84949849886926]
Intra-SE-Conformer and Inter-Transformer (ISCIT) for speech separation.
New network SE-Conformer can model audio sequences in multiple dimensions and scales.
arXiv Detail & Related papers (2023-03-07T08:53:20Z) - Separate And Diffuse: Using a Pretrained Diffusion Model for Improving
Source Separation [99.19786288094596]
We show how the upper bound can be generalized to the case of random generative models.
We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks.
arXiv Detail & Related papers (2023-01-25T18:21:51Z) - Directed Speech Separation for Automatic Speech Recognition of Long Form
Conversational Speech [10.291482850329892]
We propose a speaker conditioned separator trained on speaker embeddings extracted directly from the mixed signal.
We achieve significant improvements on Word error rate (WER) for real conversational data without the need for an additional re-stitching step.
arXiv Detail & Related papers (2021-12-10T23:07:48Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Guided Training: A Simple Method for Single-channel Speaker Separation [40.34570426165019]
We propose a strategy to train a long short-term memory (LSTM) model to solve the permutation problem in speaker separation.
Due to the powerful capability on sequence modeling, LSTM can use its memory cells to track and separate target speech from interfering speech.
arXiv Detail & Related papers (2021-03-26T08:46:50Z) - Audio-visual Speech Separation with Adversarially Disentangled Visual
Representation [23.38624506211003]
Speech separation aims to separate individual voice from an audio mixture of multiple simultaneous talkers.
In our model, we use the face detector to detect the number of speakers in the scene and use visual information to avoid the permutation problem.
Our proposed model is shown to outperform the state-of-the-art audio-only model and three audio-visual models.
arXiv Detail & Related papers (2020-11-29T10:48:42Z) - Filling the Gap of Utterance-aware and Speaker-aware Representation for
Multi-turn Dialogue [76.88174667929665]
A multi-turn dialogue is composed of multiple utterances from two or more different speaker roles.
In the existing retrieval-based multi-turn dialogue modeling, the pre-trained language models (PrLMs) as encoder represent the dialogues coarsely.
We propose a novel model to fill such a gap by modeling the effective utterance-aware and speaker-aware representations entailed in a dialogue history.
arXiv Detail & Related papers (2020-09-14T15:07:19Z) - Unsupervised Cross-lingual Representation Learning for Speech
Recognition [63.85924123692923]
XLSR learns cross-lingual speech representations by pretraining a single model from the raw waveform of speech in multiple languages.
We build on wav2vec 2.0 which is trained by solving a contrastive task over masked latent speech representations.
Experiments show that cross-lingual pretraining significantly outperforms monolingual pretraining.
arXiv Detail & Related papers (2020-06-24T18:25:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.