An Experimental Study: Assessing the Combined Framework of WavLM and
BEST-RQ for Text-to-Speech Synthesis
- URL: http://arxiv.org/abs/2312.05415v1
- Date: Fri, 8 Dec 2023 23:59:25 GMT
- Title: An Experimental Study: Assessing the Combined Framework of WavLM and
BEST-RQ for Text-to-Speech Synthesis
- Authors: Via Nielson, Steven Hillis
- Abstract summary: We propose a new model architecture specifically suited for text-to-speech (TTS) models.
We combine WavLM, a pre-trained self-supervised learning (SSL) speech model, and the BEST-RQ vector quantization framework.
Experiments on the LibriSpeech dataset with SUPERB benchmarking assert that the proposed model significantly underperforms.
- Score: 0.5076419064097734
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We propose a new model architecture specifically suited for text-to-speech
(TTS) models. We combine WavLM, a pre-trained self-supervised learning (SSL)
speech model, and the BEST-RQ vector quantization framework. We assess the
extent to which the more task-agnostic WavLM, coupled with the superior
suitability of the simplistic BEST-RQ framework for a wider array of downstream
tasks, yields favorable outcomes. Experiments on the LibriSpeech dataset with
SUPERB benchmarking assert that the proposed model significantly underperforms.
We speculate the underlying reason for this performance is related to the
difference between featurizing raw audio waveforms and spectrograms with a
quantizer. We discuss the limitations of this approach to better guide future
advancements in TTS.
Related papers
- A Large-Scale Evaluation of Speech Foundation Models [110.95827399522204]
We establish the Speech processing Universal PERformance Benchmark (SUPERB) to study the effectiveness of the foundation model paradigm for speech.
We propose a unified multi-tasking framework to address speech processing tasks in SUPERB using a frozen foundation model followed by task-specialized, lightweight prediction heads.
arXiv Detail & Related papers (2024-04-15T00:03:16Z) - GRASS: Unified Generation Model for Speech-to-Semantic Tasks [7.044414457214718]
We introduce a unified end-to-end (E2E) framework that generates target text conditioned on a task-related prompt for audio data.
Our proposed model achieves state-of-the-art (SOTA) results on many benchmarks covering speech named entity recognition, speech sentiment analysis, speech question answering, and more.
To facilitate future work on instruction fine-tuning for speech-to-semantic tasks, we release our instruction dataset and code.
arXiv Detail & Related papers (2023-09-06T06:44:26Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - LDNet: Unified Listener Dependent Modeling in MOS Prediction for
Synthetic Speech [67.88748572167309]
We present LDNet, a unified framework for mean opinion score (MOS) prediction.
We propose two inference methods that provide more stable results and efficient computation.
arXiv Detail & Related papers (2021-10-18T08:52:31Z) - Low-Latency Incremental Text-to-Speech Synthesis with Distilled Context
Prediction Network [41.4599368523939]
We propose an incremental TTS method that directly predicts the unobserved future context with a lightweight model.
Experimental results show that the proposed method requires about ten times less inference time to achieve comparable synthetic speech quality.
arXiv Detail & Related papers (2021-09-22T13:29:10Z) - SUPERB: Speech processing Universal PERformance Benchmark [78.41287216481203]
Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV)
SuperB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks.
We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model.
arXiv Detail & Related papers (2021-05-03T17:51:09Z) - Multimodal Semi-supervised Learning Framework for Punctuation Prediction
in Conversational Speech [17.602098162338137]
We explore a multimodal semi-supervised learning approach for punctuation prediction.
We learn representations from large amounts of unlabelled audio and text data.
When trained on 1 hour of speech and text data, the proposed model achieved 9-18% absolute improvement over baseline model.
arXiv Detail & Related papers (2020-08-03T08:13:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.