Compositional embedding models for speaker identification and
diarization with simultaneous speech from 2+ speakers
- URL: http://arxiv.org/abs/2010.11803v2
- Date: Wed, 10 Feb 2021 15:47:18 GMT
- Title: Compositional embedding models for speaker identification and
diarization with simultaneous speech from 2+ speakers
- Authors: Zeqian Li, Jacob Whitehill
- Abstract summary: We propose a new method for speaker diarization that can handle overlapping speech with 2+ people.
Our method is based on compositional embeddings.
- Score: 25.280566939206714
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a new method for speaker diarization that can handle overlapping
speech with 2+ people. Our method is based on compositional embeddings [1]:
Like standard speaker embedding methods such as x-vector [2], compositional
embedding models contain a function f that separates speech from different
speakers. In addition, they include a composition function g to compute
set-union operations in the embedding space so as to infer the set of speakers
within the input audio. In an experiment on multi-person speaker identification
using synthesized LibriSpeech data, the proposed method outperforms traditional
embedding methods that are only trained to separate single speakers (not
speaker sets). In a speaker diarization experiment on the AMI Headset Mix
corpus, we achieve state-of-the-art accuracy (DER=22.93%), slightly higher than
the previous best result (23.82% from [3]).
Related papers
- Online speaker diarization of meetings guided by speech separation [0.0]
Overlapped speech is notoriously problematic for speaker diarization systems.
We introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings.
arXiv Detail & Related papers (2024-01-30T09:09:22Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Human-in-the-loop Speaker Adaptation for DNN-based Multi-speaker TTS [36.023566245506046]
We propose a human-in-the-loop speaker-adaptation method for multi-speaker text-to-speech.
The proposed method uses a sequential line search algorithm that repeatedly asks a user to select a point on a line segment in the embedding space.
Experimental results indicate that the proposed method can achieve comparable performance to the conventional one in objective and subjective evaluations.
arXiv Detail & Related papers (2022-06-21T11:08:05Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints [36.07346889498981]
We propose GC-TTS which achieves high-quality speaker adaptation with significantly improved speaker similarity.
A TTS model is pre-trained for base speakers with a sufficient amount of data, and then fine-tuned for novel speakers on a few minutes of data with two geometric constraints.
The experimental results demonstrate that GC-TTS generates high-quality speech from only a few minutes of training data, outperforming standard techniques in terms of speaker similarity to the target speaker.
arXiv Detail & Related papers (2021-08-16T04:25:31Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - End-to-End Speaker Diarization as Post-Processing [64.12519350944572]
Clustering-based diarization methods partition frames into clusters of the number of speakers.
Some end-to-end diarization methods can handle overlapping speech by treating the problem as multi-label classification.
We propose to use a two-speaker end-to-end diarization method as post-processing of the results obtained by a clustering-based method.
arXiv Detail & Related papers (2020-12-18T05:31:07Z) - Speaker Separation Using Speaker Inventories and Estimated Speech [78.57067876891253]
We propose speaker separation using speaker inventories (SSUSI) and speaker separation using estimated speech (SSUES)
By combining the advantages of permutation invariant training (PIT) and speech extraction, SSUSI significantly outperforms conventional approaches.
arXiv Detail & Related papers (2020-10-20T18:15:45Z) - Joint Speaker Counting, Speech Recognition, and Speaker Identification
for Overlapped Speech of Any Number of Speakers [38.3469744871394]
We propose an end-to-end speaker-attributed automatic speech recognition model.
It unifies speaker counting, speech recognition, and speaker identification on overlapped speech.
arXiv Detail & Related papers (2020-06-19T02:05:18Z) - Voice Separation with an Unknown Number of Multiple Speakers [113.91855071999298]
We present a new method for separating a mixed audio sequence, in which multiple voices speak simultaneously.
The new method employs gated neural networks that are trained to separate the voices at multiple processing steps, while maintaining the speaker in each output channel fixed.
arXiv Detail & Related papers (2020-02-29T20:02:54Z) - Supervised Speaker Embedding De-Mixing in Two-Speaker Environment [37.27421131374047]
Instead of separating a two-speaker signal in signal space like speech source separation, a speaker embedding de-mixing approach is proposed.
The proposed approach separates different speaker properties from a two-speaker signal in embedding space.
arXiv Detail & Related papers (2020-01-14T20:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.