SVSNet: An End-to-end Speaker Voice Similarity Assessment Model
- URL: http://arxiv.org/abs/2107.09392v1
- Date: Tue, 20 Jul 2021 10:19:46 GMT
- Title: SVSNet: An End-to-end Speaker Voice Similarity Assessment Model
- Authors: Cheng-Hung Hu, Yu-Huai Peng, Junichi Yamagishi, Yu Tsao, Hsin-Min Wang
- Abstract summary: We propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between natural speech and synthesized speech.
The experimental results on the Voice Conversion Challenge 2018 and 2020 show that SVSNet notably outperforms well-known baseline systems.
- Score: 61.3813595968834
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural evaluation metrics derived for numerous speech generation tasks have
recently attracted great attention. In this paper, we propose SVSNet, the first
end-to-end neural network model to assess the speaker voice similarity between
natural speech and synthesized speech. Unlike most neural evaluation metrics
that use hand-crafted features, SVSNet directly takes the raw waveform as input
to more completely utilize speech information for prediction. SVSNet consists
of encoder, co-attention, distance calculation, and prediction modules and is
trained in an end-to-end manner. The experimental results on the Voice
Conversion Challenge 2018 and 2020 (VCC2018 and VCC2020) datasets show that
SVSNet notably outperforms well-known baseline systems in the assessment of
speaker similarity at the utterance and system levels.
Related papers
- SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models [31.813459806715056]
Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks.
We propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - Online speaker diarization of meetings guided by speech separation [0.0]
Overlapped speech is notoriously problematic for speaker diarization systems.
We introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings.
arXiv Detail & Related papers (2024-01-30T09:09:22Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation.
These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors.
For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - End-to-End Diarization for Variable Number of Speakers with Local-Global
Networks and Discriminative Speaker Embeddings [66.50782702086575]
We present an end-to-end deep network model that performs meeting diarization from single-channel audio recordings.
The proposed system is designed to handle meetings with unknown numbers of speakers, using variable-number permutation-invariant cross-entropy based loss functions.
arXiv Detail & Related papers (2021-05-05T14:55:29Z) - Deep Learning Based Assessment of Synthetic Speech Naturalness [14.463987018380468]
We present a new objective prediction model for synthetic speech naturalness.
It can be used to evaluate Text-To-Speech or Voice Conversion systems.
arXiv Detail & Related papers (2021-04-23T16:05:20Z) - The HUAWEI Speaker Diarisation System for the VoxCeleb Speaker
Diarisation Challenge [6.6238321827660345]
This paper describes system setup of our submission to speaker diarisation track (Track 4) of VoxCeleb Speaker Recognition Challenge 2020.
Our diarisation system consists of a well-trained neural network based speech enhancement model as pre-processing front-end of input speech signals.
arXiv Detail & Related papers (2020-10-22T12:42:07Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z) - Comparison of Speech Representations for Automatic Quality Estimation in
Multi-Speaker Text-to-Speech Synthesis [21.904558308567122]
We aim to characterize how different speakers contribute to the perceived output quality of multi-speaker Text-to-Speech synthesis.
We automatically rate the quality of TTS using a neural network (NN) trained on human mean opinion score (MOS) ratings.
arXiv Detail & Related papers (2020-02-28T10:44:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.