An Experimental Study: Assessing the Combined Framework of WavLM and
BEST-RQ for Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2312.05415v1
- Date: Fri, 8 Dec 2023 23:59:25 GMT
- Title: An Experimental Study: Assessing the Combined Framework of WavLM and
BEST-RQ for Text-to-Speech Synthesis
- Authors: Via Nielson, Steven Hillis
- Abstract summary: We propose a new model architecture specifically suited for text-to-speech (TTS) models.
We combine WavLM, a pre-trained self-supervised learning (SSL) speech model, and the BEST-RQ vector quantization framework.
Experiments on the LibriSpeech dataset with SUPERB benchmarking assert that the proposed model significantly underperforms.
- Score: 0.5076419064097734
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new model architecture specifically suited for text-to-speech
(TTS) models. We combine WavLM, a pre-trained self-supervised learning (SSL)
speech model, and the BEST-RQ vector quantization framework. We assess the
extent to which the more task-agnostic WavLM, coupled with the superior
suitability of the simplistic BEST-RQ framework for a wider array of downstream
tasks, yields favorable outcomes. Experiments on the LibriSpeech dataset with
SUPERB benchmarking assert that the proposed model significantly underperforms.
We speculate the underlying reason for this performance is related to the
difference between featurizing raw audio waveforms and spectrograms with a
quantizer. We discuss the limitations of this approach to better guide future
advancements in TTS.
Related papers
- Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)
Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)
We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z) - Enhancing Question Answering Precision with Optimized Vector Retrieval and Instructions [1.2425910171551517]
Question-answering (QA) is an important application of Information Retrieval (IR) and language models.
We propose an innovative approach to improve QA task performances by integrating optimized vector retrievals and instruction methodologies.
arXiv Detail & Related papers (2024-11-01T21:14:04Z) - NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training [17.54331997432642]
We introduce the next token prediction based speech pre-training method with random-projection quantizer (NEST-RQ)
NEST-RQ employs causal encoders with only left context and uses next token prediction (NTP) as the training task.
On the large-scale dataset, compared to BEST-RQ, the proposed NEST-RQ achieves comparable performance on non-streaming automatic speech recognition (ASR) and better performance on streaming ASR.
arXiv Detail & Related papers (2024-09-13T09:48:11Z) - A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech.
We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
arXiv Detail & Related papers (2024-04-15T00:03:16Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - SUPERB: Speech processing Universal PERformance Benchmark [78.41287216481203]
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV)
SuperB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks.
We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.
arXiv Detail & Related papers (2021-05-03T17:51:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.