SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models
- URL: http://arxiv.org/abs/2406.08445v1
- Date: Wed, 12 Jun 2024 17:37:09 GMT
- Title: SVSNet+: Enhancing Speaker Voice Similarity Assessment Models with Representations from Speech Foundation Models
- Authors: Chun Yin, Tai-Shih Chi, Yu Tsao, Hsin-Min Wang,
- Abstract summary: Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks.
We propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity.
- Score: 31.813459806715056
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Representations from pre-trained speech foundation models (SFMs) have shown impressive performance in many downstream tasks. However, the potential benefits of incorporating pre-trained SFM representations into speaker voice similarity assessment have not been thoroughly investigated. In this paper, we propose SVSNet+, a model that integrates pre-trained SFM representations to improve performance in assessing speaker voice similarity. Experimental results on the Voice Conversion Challenge 2018 and 2020 datasets show that SVSNet+ incorporating WavLM representations shows significant improvements compared to baseline models. In addition, while fine-tuning WavLM with a small dataset of the downstream task does not improve performance, using the same dataset to learn a weighted-sum representation of WavLM can substantially improve performance. Furthermore, when WavLM is replaced by other SFMs, SVSNet+ still outperforms the baseline models and exhibits strong generalization ability.
Related papers
- Benchmarking Children's ASR with Supervised and Self-supervised Speech Foundation Models [23.383924361298874]
Speech foundation models (SFMs) have achieved state-of-the-art results for various speech tasks in supervised (e.g. Whisper) or self-supervised systems (e.g. WavLM)
arXiv Detail & Related papers (2024-06-15T05:13:19Z) - Audio Mamba: Selective State Spaces for Self-Supervised Audio Representations [16.269123889392343]
This work proposes Audio Mamba, a selective state space model for learning general-purpose audio representations.
Empirical results on ten diverse audio recognition downstream tasks show that the proposed models consistently outperform comparable self-supervised audio spectrogram transformer baselines.
arXiv Detail & Related papers (2024-06-04T10:19:14Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models [92.92233932921741]
We propose the AV-SUPERB benchmark that enables general-purpose evaluation of unimodal audio/visual and bimodal fusion representations.
We evaluate 5 recent self-supervised models and show that none of these models generalize to all tasks.
We show that representations may be improved with intermediate-task fine-tuning and audio event classification with AudioSet serves as a strong intermediate task.
arXiv Detail & Related papers (2023-09-19T17:35:16Z) - Audio-visual speech enhancement with a deep Kalman filter generative
model [0.0]
We present an audiovisual deep Kalman filter (AV-DKF) generative model which assumes a first-order Markov chain model for the latent variables.
We develop an efficient inference methodology to estimate speech signals at test time.
arXiv Detail & Related papers (2022-11-02T09:50:08Z) - Contrastive Audio-Visual Masked Autoencoder [85.53776628515561]
Contrastive Audio-Visual Masked Auto-Encoder (CAV-MAE)
Our fully self-supervised pretrained CAV-MAE achieves a new SOTA accuracy of 65.9% on VGGSound.
arXiv Detail & Related papers (2022-10-02T07:29:57Z) - Self-Supervised Representation Learning for Speech Using Visual
Grounding and Masked Language Modeling [13.956691231452336]
FaST-VGS is a Transformer-based model that learns to associate raw speech waveforms with semantically related images.
FaST-VGS+ is learned in a multi-task fashion with a masked language modeling objective.
We show that our models perform competitively on the ABX task, outperform all other concurrent submissions on the Syntactic and Semantic tasks, and nearly match the best system on the Lexical task.
arXiv Detail & Related papers (2022-02-07T22:09:54Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - SVSNet: An End-to-end Speaker Voice Similarity Assessment Model [61.3813595968834]
We propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between natural speech and synthesized speech.
The experimental results on the Voice Conversion Challenge 2018 and 2020 show that SVSNet notably outperforms well-known baseline systems.
arXiv Detail & Related papers (2021-07-20T10:19:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.