SepIt: Approaching a Single Channel Speech Separation Bound
- URL: http://arxiv.org/abs/2205.11801v4
- Date: Sun, 21 May 2023 09:32:04 GMT
- Title: SepIt: Approaching a Single Channel Speech Separation Bound
- Authors: Shahar Lutati, Eliya Nachmani, Lior Wolf
- Abstract summary: We introduce a Deep neural network, SepIt, that iteratively improves the different speakers' estimation.
In an extensive set of experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3, 5, and 10 speakers.
- Score: 99.19786288094596
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present an upper bound for the Single Channel Speech Separation task,
which is based on an assumption regarding the nature of short segments of
speech. Using the bound, we are able to show that while the recent methods have
made significant progress for a few speakers, there is room for improvement for
five and ten speakers. We then introduce a Deep neural network, SepIt, that
iteratively improves the different speakers' estimation. At test time, SpeIt
has a varying number of iterations per test sample, based on a mutual
information criterion that arises from our analysis. In an extensive set of
experiments, SepIt outperforms the state-of-the-art neural networks for 2, 3,
5, and 10 speakers.
Related papers
- Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - BASEN: Time-Domain Brain-Assisted Speech Enhancement Network with
Convolutional Cross Attention in Multi-talker Conditions [36.15815562576836]
Time-domain single-channel speech enhancement (SE) still remains challenging to extract the target speaker without prior information on multi-talker conditions.
We propose a novel time-domain brain-assisted SE network (BASEN) incorporating electroencephalography (EEG) signals recorded from the listener for extracting the target speaker from monaural speech mixtures.
arXiv Detail & Related papers (2023-05-17T06:40:31Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - You Only Hear Once: A YOLO-like Algorithm for Audio Segmentation and
Sound Event Detection [0.0]
We present a novel approach called You Only Hear Once (YOHO)
We convert the detection of acoustic boundaries into a regression problem instead of frame-based classification.
YOHO obtained a higher F-measure and lower error rate than the state-of-the-art Convolutional Recurrent Neural Network.
arXiv Detail & Related papers (2021-09-01T12:50:16Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z) - Single channel voice separation for unknown number of speakers under
reverberant and noisy settings [106.48335929548875]
We present a unified network for voice separation of an unknown number of speakers.
The proposed approach is composed of several separation heads optimized together with a speaker classification branch.
We present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
arXiv Detail & Related papers (2020-11-04T14:59:14Z) - HarperValleyBank: A Domain-Specific Spoken Dialog Corpus [7.331287001215395]
HarperValleyBank is a free, public domain spoken dialog corpus.
The data simulate simple consumer banking interactions, containing about 23 hours of audio from 1,446 human-human conversations.
arXiv Detail & Related papers (2020-10-26T22:16:52Z) - Voice Separation with an Unknown Number of Multiple Speakers [113.91855071999298]
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed.
arXiv Detail & Related papers (2020-02-29T20:02:54Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.