Automatic Evaluation of Speaker Similarity
- URL: http://arxiv.org/abs/2207.00344v1
- Date: Fri, 1 Jul 2022 11:23:16 GMT
- Title: Automatic Evaluation of Speaker Similarity
- Authors: Deja Kamil, Sanchez Ariadna, Roth Julian, Cotescu Marius
- Abstract summary: We introduce a new automatic evaluation method for speaker similarity assessment, consistent with human perceptual scores.
Our experiments show that we can train a model to predict speaker similarity MUSHRA scores from speaker embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson score at the utterance level.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a new automatic evaluation method for speaker similarity
assessment, that is consistent with human perceptual scores. Modern neural
text-to-speech models require a vast amount of clean training data, which is
why many solutions switch from single speaker models to solutions trained on
examples from many different speakers. Multi-speaker models bring new
possibilities, such as a faster creation of new voices, but also a new problem
- speaker leakage, where the speaker identity of a synthesized example might
not match those of the target speaker. Currently, the only way to discover this
issue is through costly perceptual evaluations. In this work, we propose an
automatic method for assessment of speaker similarity. For that purpose, we
extend the recent work on speaker verification systems and evaluate how
different metrics and speaker embeddings models reflect Multiple Stimuli with
Hidden Reference and Anchor (MUSHRA) scores. Our experiments show that we can
train a model to predict speaker similarity MUSHRA scores from speaker
embeddings with 0.96 accuracy and significant correlation up to 0.78 Pearson
score at the utterance level.
Related papers
- Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Self-supervised Speaker Diarization [19.111219197011355]
This study proposes an entirely unsupervised deep-learning model for speaker diarization.
Speaker embeddings are represented by an encoder trained in a self-supervised fashion using pairs of adjacent segments assumed to be of the same speaker.
arXiv Detail & Related papers (2022-04-08T16:27:14Z) - Improved Relation Networks for End-to-End Speaker Verification and
Identification [0.0]
Speaker identification systems are tasked to identify a speaker amongst a set of enrolled speakers given just a few samples.
We propose improved relation networks for speaker verification and few-shot (unseen) speaker identification.
Inspired by the use of prototypical networks in speaker verification, we train the model to classify samples in the current episode amongst all speakers present in the training set.
arXiv Detail & Related papers (2022-03-31T17:44:04Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Speaker Generation [16.035697779803627]
This work explores the task of synthesizing speech in nonexistent human-sounding voices.
We present TacoSpawn, a system that performs competitively at this task.
arXiv Detail & Related papers (2021-11-07T22:31:41Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Speaker-Conditioned Hierarchical Modeling for Automated Speech Scoring [60.55025339250815]
We propose a novel deep learning technique for non-native ASS, called speaker-conditioned hierarchical modeling.
We take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. In our technique, we take advantage of the fact that oral proficiency tests rate multiple responses for a candidate. We extract context from these responses and feed them as additional speaker-specific context to our network to score a particular response.
arXiv Detail & Related papers (2021-08-30T07:00:28Z) - Unified Autoregressive Modeling for Joint End-to-End Multi-Talker
Overlapped Speech Recognition and Speaker Attribute Estimation [26.911867847630187]
We present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems.
We propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation.
arXiv Detail & Related papers (2021-07-04T05:47:18Z) - Self-supervised Text-independent Speaker Verification using Prototypical
Momentum Contrastive Learning [58.14807331265752]
We show that better speaker embeddings can be learned by momentum contrastive learning.
We generalize the self-supervised framework to a semi-supervised scenario where only a small portion of the data is labeled.
arXiv Detail & Related papers (2020-12-13T23:23:39Z) - Speaker Sensitive Response Evaluation Model [17.381658875470638]
We propose an automatic evaluation model based on the similarity of the generated response with the conversational context.
We learn the model parameters from an unlabeled conversation corpus.
We show that our model can be applied to movie dialogues without any additional training.
arXiv Detail & Related papers (2020-06-12T08:59:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.