Related papers: On the Effectiveness of Acoustic BPE in Decoder-Only TTS

On the Effectiveness of Acoustic BPE in Decoder-Only TTS

URL: http://arxiv.org/abs/2407.03892v1
Date: Thu, 4 Jul 2024 12:35:32 GMT
Title: On the Effectiveness of Acoustic BPE in Decoder-Only TTS
Authors: Bohan Li, Feiyu Shen, Yiwei Guo, Shuai Wang, Xie Chen, Kai Yu,
Abstract summary: Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM) To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence. We conduct a study on various settings of acoustic BPE to explore its effectiveness in decoder-only TTS models with semantic speech tokens.
Score: 16.013858075350054
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Discretizing speech into tokens and generating them by a decoder-only model have been a promising direction for text-to-speech (TTS) and spoken language modeling (SLM). To shorten the sequence length of speech tokens, acoustic byte-pair encoding (BPE) has emerged in SLM that treats speech tokens from self-supervised semantic representations as characters to further compress the token sequence. But the gain in TTS has not been fully investigated, and the proper choice of acoustic BPE remains unclear. In this work, we conduct a comprehensive study on various settings of acoustic BPE to explore its effectiveness in decoder-only TTS models with semantic speech tokens. Experiments on LibriTTS verify that acoustic BPE uniformly increases the intelligibility and diversity of synthesized speech, while showing different features across BPE settings. Hence, acoustic BPE is a favorable tool for decoder-only TTS.

Related papers

LM-SPT: LM-Aligned Semantic Distillation for Speech Tokenization [8.365515332927444]
Recent speech tokenization approaches aim to isolate semantic information from low-level acoustics to better align with language models.<n>We propose LM-SPT, a speech tokenization method that introduces a novel semantic distillation.<n>We show that LM-SPT achieves superior reconstruction fidelity compared to baselines.
arXiv Detail & Related papers (2025-06-20T04:15:14Z)
Koel-TTS: Enhancing LLM based Speech Generation with Preference Alignment and Classifier Free Guidance [9.87139502863569]
Koel-TTS is a suite of enhanced encoder-decoder Transformer TTS models. We introduce Koel-TTS, a suite of enhanced encoder-decoder Transformer TTS models.
arXiv Detail & Related papers (2025-02-07T06:47:11Z)
TTS-Transducer: End-to-End Speech Synthesis with Neural Transducer [6.1319363847980135]
TTS-Transducer is a novel architecture for text-to-speech, leveraging the strengths of audio models and neural transducers. We show that TTS-Transducer is a competitive and robust alternative to contemporary TTS systems.
arXiv Detail & Related papers (2025-01-10T19:50:32Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
Improving Robustness of LLM-based Speech Synthesis by Learning Monotonic Alignment [19.48653924804823]
Large Language Model (LLM) based text-to-speech (TTS) systems have demonstrated remarkable capabilities in handling large speech datasets and generating natural speech for new speakers. However, LLM-based TTS models are not robust as the generated output can contain repeating words, missing words and mis-aligned speech. We examine these challenges in an encoder-decoder transformer model and find that certain cross-attention heads in such models implicitly learn the text and speech alignment when trained for predicting speech tokens for a given text.
arXiv Detail & Related papers (2024-06-25T22:18:52Z)
Improving Audio Codec-based Zero-Shot Text-to-Speech Synthesis with Multi-Modal Context and Large Language Model [11.62674351793]
We introduce a novel audio-based TTS model to adapt context features with multiple enhancements. Inspired by the success of Qformer, we propose a multi-modal context-enhanced Qformer. Our proposed method outperforms baselines across various context TTS scenarios.
arXiv Detail & Related papers (2024-06-06T03:06:45Z)
On decoder-only architecture for speech-to-text and large language model integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models. We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts. Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment. We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z)
Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model. It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z)
Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z)
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus. We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z)
Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS) A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation. We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.