Supervised Speaker Embedding De-Mixing in Two-Speaker Environment
- URL: http://arxiv.org/abs/2001.06397v2
- Date: Fri, 5 Feb 2021 15:46:54 GMT
- Title: Supervised Speaker Embedding De-Mixing in Two-Speaker Environment
- Authors: Yanpei Shi, Thomas Hain
- Abstract summary: Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed.
The proposed approach separates different speaker properties from a two-speaker signal in embedding space.
- Score: 37.27421131374047
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Separating different speaker properties from a multi-speaker environment is
challenging. Instead of separating a two-speaker signal in signal space like
speech source separation, a speaker embedding de-mixing approach is proposed.
The proposed approach separates different speaker properties from a two-speaker
signal in embedding space. The proposed approach contains two steps. In step
one, the clean speaker embeddings are learned and collected by a residual TDNN
based network. In step two, the two-speaker signal and the embedding of one of
the speakers are both input to a speaker embedding de-mixing network. The
de-mixing network is trained to generate the embedding of the other speaker by
reconstruction loss. Speaker identification accuracy and the cosine similarity
score between the clean embeddings and the de-mixed embeddings are used to
evaluate the quality of the obtained embeddings. Experiments are done in two
kind of data: artificial augmented two-speaker data (TIMIT) and real world
recording of two-speaker data (MC-WSJ). Six different speaker embedding
de-mixing architectures are investigated. Comparing with the performance on the
clean speaker embeddings, the obtained results show that one of the proposed
architectures obtained close performance, reaching 96.9% identification
accuracy and 0.89 cosine similarity.
Related papers
- In search of strong embedding extractors for speaker diarisation [49.7017388682077]
We tackle two key problems when adopting EEs for speaker diarisation.
First, the evaluation is not straightforward because the features required for better performance differ between speaker verification and diarisation.
We show that better performance on widely adopted speaker verification evaluation protocols does not lead to better diarisation performance.
We propose two data augmentation techniques to alleviate the second problem, making embedding extractors aware of overlapped speech or speaker change input.
arXiv Detail & Related papers (2022-10-26T13:00:29Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Time-Domain Speech Extraction with Spatial Information and Multi Speaker
Conditioning Mechanism [27.19635746008699]
We present a novel multi-channel speech extraction system to simultaneously extract multiple clean individual sources from a mixture.
The proposed method is built on an improved multi-channel time-domain speech separation network.
Experiments on 2-channel WHAMR! data show that the proposed system improves by 9% relative the source separation performance over a strong multi-channel baseline.
arXiv Detail & Related papers (2021-02-07T10:11:49Z) - Single channel voice separation for unknown number of speakers under
reverberant and noisy settings [106.48335929548875]
We present a unified network for voice separation of an unknown number of speakers.
The proposed approach is composed of several separation heads optimized together with a speaker classification branch.
We present a new noisy and reverberant dataset of up to five different speakers speaking simultaneously.
arXiv Detail & Related papers (2020-11-04T14:59:14Z) - Compositional embedding models for speaker identification and
diarization with simultaneous speech from 2+ speakers [25.280566939206714]
We propose a new method for speaker diarization that can handle overlapping speech with 2+ people.
Our method is based on compositional embeddings.
arXiv Detail & Related papers (2020-10-22T15:33:36Z) - DNN Speaker Tracking with Embeddings [0.0]
We propose a novel embedding-based speaker tracking method.
Our design is based on a convolutional neural network that mimics a typical speaker verification PLDA.
To make the baseline system similar to speaker tracking, non-target speakers were added to the recordings.
arXiv Detail & Related papers (2020-07-13T18:40:14Z) - Identify Speakers in Cocktail Parties with End-to-End Attention [48.96655134462949]
This paper presents an end-to-end system that integrates speech source extraction and speaker identification.
We propose a new way to jointly optimize these two parts by max-pooling the speaker predictions along the channel dimension.
End-to-end training results in a system that recognizes one speaker in a two-speaker broadcast speech mixture with 99.9% accuracy and both speakers with 93.9% accuracy.
arXiv Detail & Related papers (2020-05-22T22:15:16Z) - Audio-visual Multi-channel Recognition of Overlapped Speech [79.21950701506732]
This paper presents an audio-visual multi-channel overlapped speech recognition system featuring tightly integrated separation front-end and recognition back-end.
Experiments suggest that the proposed multi-channel AVSR system outperforms the baseline audio-only ASR system by up to 6.81% (26.83% relative) and 22.22% (56.87% relative) absolute word error rate (WER) reduction on overlapped speech constructed using either simulation or replaying of the lipreading sentence 2 dataset respectively.
arXiv Detail & Related papers (2020-05-18T10:31:19Z) - Robust Speaker Recognition Using Speech Enhancement And Attention Model [37.33388614967888]
Instead of individually processing speech enhancement and speaker recognition, the two modules are integrated into one framework by a joint optimisation using deep neural networks.
To increase robustness against noise, a multi-stage attention mechanism is employed to highlight the speaker related features learned from context information in time and frequency domain.
The obtained results show that the proposed approach using speech enhancement and multi-stage attention models outperforms two strong baselines not using them in most acoustic conditions in our experiments.
arXiv Detail & Related papers (2020-01-14T20:03:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.