Related papers: Continuous Speech Tokenizer in Text To Speech

Continuous Speech Tokenizer in Text To Speech

URL: http://arxiv.org/abs/2410.17081v1
Date: Tue, 22 Oct 2024 15:02:37 GMT
Title: Continuous Speech Tokenizer in Text To Speech
Authors: Yixing Li, Ruobing Xie, Xingwu Sun, Yu Cheng, Zhanhui Kang,
Abstract summary: We propose a simple yet effective continuous speech tokenizer and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS) This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain.
Score: 27.057221389827735
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The fusion of speech and language in the era of large language models has garnered significant attention. Discrete speech token is often utilized in text-to-speech tasks for speech compression and portability, which is convenient for joint training with text and have good compression efficiency. However, we found that the discrete speech tokenizer still suffers from information loss. Therefore, we propose a simple yet effective continuous speech tokenizer and a text-to-speech model based on continuous speech tokens. Our results show that the speech language model based on the continuous speech tokenizer has better continuity and higher estimated Mean Opinion Scores (MoS). This enhancement is attributed to better information preservation rate of the continuous speech tokenizer across both low and high frequencies in the frequency domain.

Related papers

SyncSpeech: Low-Latency and Efficient Dual-Stream Text-to-Speech based on Temporal Masked Transformer [68.78023656892319]
This paper presents a dual-stream text-to-speech (TTS) model, SyncSpeech, capable of receiving streaming text input from upstream models while simultaneously generating streaming speech. SyncSpeech has the following advantages: Low latency, as it begins generating streaming speech upon receiving the second text token; High efficiency, as it decodes all speech tokens corresponding to the each arrived text token in one step.
arXiv Detail & Related papers (2025-02-16T12:14:17Z)
Scaling Speech-Text Pre-training with Synthetic Interleaved Data [31.77653849518526]
Speech language models (SpeechLMs) accept speech input and produce speech output, allowing for more natural human-computer interaction. Traditional approaches for developing SpeechLMs are constrained by the limited availability of unsupervised speech data and parallel speech-text data. We propose a novel approach to scaling speech-text pre-training by leveraging large-scale synthetic interleaved data derived from text corpora.
arXiv Detail & Related papers (2024-11-26T17:19:09Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
TransVIP: Speech to Speech Translation System with Voice and Isochrony Preservation [97.54885207518946]
We introduce a novel model framework TransVIP that leverages diverse datasets in a cascade fashion. We propose two separated encoders to preserve the speaker's voice characteristics and isochrony from the source speech during the translation process. Our experiments on the French-English language pair demonstrate that our model outperforms the current state-of-the-art speech-to-speech translation model.
arXiv Detail & Related papers (2024-05-28T04:11:37Z)
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models [58.996653700982556]
Existing speech tokens are not specifically designed for speech language modeling. We propose SpeechTokenizer, a unified speech tokenizer for speech large language models. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark.
arXiv Detail & Related papers (2023-08-31T12:53:09Z)
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech. We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z)
token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text [65.04385919645395]
token2vec is a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech. Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction.
arXiv Detail & Related papers (2022-10-30T06:38:19Z)
A Textless Metric for Speech-to-Speech Comparison [20.658229254191266]
We introduce a new and simple method for comparing speech utterances without relying on text transcripts. Our speech-to-speech comparison metric utilizes state-of-the-art speech2unit encoders like HuBERT to convert speech utterances into discrete acoustic units.
arXiv Detail & Related papers (2022-10-21T09:28:54Z)
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.