Related papers: The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech

The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech

URL: http://arxiv.org/abs/2409.09305v1
Date: Sat, 14 Sep 2024 05:03:18 GMT
Title: The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
Authors: Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari,
Abstract summary: We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024. Our system was designed for the VMC 2024 Track 1, which focused on the accurate prediction of naturalness mean opinion score (MOS) for high-quality synthetic speech. In the VMC 2024 Track 1, our T05 system achieved first place in 7 out of 16 evaluation metrics and second place in the remaining 9 metrics, with a significant difference compared to those ranked third and below.
Score: 28.168242593106566
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024. Our system was designed for the VMC 2024 Track 1, which focused on the accurate prediction of naturalness mean opinion score (MOS) for high-quality synthetic speech. In addition to a pretrained self-supervised learning (SSL)-based speech feature extractor, our system incorporates a pretrained image feature extractor to capture the difference of synthetic speech observed in speech spectrograms. We first separately train two MOS predictors that use either of an SSL-based or spectrogram-based feature. Then, we fine-tune the two predictors for better MOS prediction using the fusion of two extracted features. In the VMC 2024 Track 1, our T05 system achieved first place in 7 out of 16 evaluation metrics and second place in the remaining 9 metrics, with a significant difference compared to those ranked third and below. We also report the results of our ablation study to investigate essential factors of our system.

Related papers

MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders. Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z)
Uncertainty as a Predictor: Leveraging Self-Supervised Learning for Zero-Shot MOS Prediction [40.51248841706311]
This paper addresses the gap in efficient audio quality prediction, especially in low-resource settings. We demonstrate that uncertainty measures derived from out-of-the-box pretrained self-supervised learning models, such as wav2vec, correlate with VoiceMOS scores.
arXiv Detail & Related papers (2023-12-25T05:35:28Z)
Investigating Content-Aware Neural Text-To-Speech MOS Prediction Using Prosodic and Linguistic Features [54.48824266041105]
Current state-of-the-art methods for automatic synthetic speech evaluation are based on MOS prediction neural models. We propose to include prosodic and linguistic features as additional inputs in MOS prediction systems. All MOS prediction systems are trained on SOMOS, a neural TTS-only dataset with crowdsourced naturalness MOS evaluations.
arXiv Detail & Related papers (2022-11-01T09:18:50Z)
Deep Spectro-temporal Artifacts for Detecting Synthesized Speech [57.42110898920759]
This paper provides an overall assessment of track 1 (Low-quality Fake Audio Detection) and track 2 (Partially Fake Audio Detection) In this paper, spectro-temporal artifacts were detected using raw temporal signals, spectral features, as well as deep embedding features. We ranked 4th and 5th in track 1 and track 2, respectively.
arXiv Detail & Related papers (2022-10-11T08:31:30Z)
Comparison of Speech Representations for the MOS Prediction System [1.2949520455740093]
We conduct experiments on a large-scale listening test corpus collected from past Blizzard and Voice Conversion Challenges. We find that the wav2vec feature set showed the best generalization even though the given ground-truth was not always reliable.
arXiv Detail & Related papers (2022-06-28T08:18:18Z)
SOMOS: The Samsung Open MOS Dataset for the Evaluation of Neural Text-to-Speech Synthesis [50.236929707024245]
The SOMOS dataset is the first large-scale mean opinion scores (MOS) dataset consisting of solely neural text-to-speech (TTS) samples. It consists of 20K synthetic utterances of the LJ Speech voice, a public domain speech dataset.
arXiv Detail & Related papers (2022-04-06T18:45:20Z)
The RoyalFlush System of Speech Recognition for M2MeT Challenge [5.863625637354342]
This paper describes our RoyalFlush system for the track of multi-speaker automatic speech recognition (ASR) in the M2MeT challenge. We adopted the serialized output training (SOT) based multi-speakers ASR system with large-scale simulation data. Our system got a 12.22% absolute Character Error Rate (CER) reduction on the validation set and 12.11% on the test set.
arXiv Detail & Related papers (2022-02-03T14:38:26Z)
STC speaker recognition systems for the NIST SRE 2021 [56.05258832139496]
This paper presents a description of STC Ltd. systems submitted to the NIST 2021 Speaker Recognition Evaluation. These systems consists of a number of diverse subsystems based on using deep neural networks as feature extractors. For video modality we developed our best solution with RetinaFace face detector and deep ResNet face embeddings extractor trained on large face image datasets.
arXiv Detail & Related papers (2021-11-03T15:31:01Z)
DelightfulTTS: The Microsoft Speech Synthesis System for Blizzard Challenge 2021 [31.750875486806184]
This paper describes the Microsoft end-to-end neural text to speech (TTS) system: DelightfulTTS for Blizzard Challenge 2021. The goal of this challenge is to synthesize natural and high-quality speech from text, and we approach this goal in two perspectives.
arXiv Detail & Related papers (2021-10-25T02:47:59Z)
Scenario Aware Speech Recognition: Advancements for Apollo Fearless Steps & CHiME-4 Corpora [70.46867541361982]
We consider a general non-semantic speech representation, which is trained with a self-supervised criteria based on triplet loss called TRILL. We observe +5.42% and +3.18% relative WER improvement for the development and evaluation sets of Fearless Steps.
arXiv Detail & Related papers (2021-09-23T00:43:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.