SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language
Models
- URL: http://arxiv.org/abs/2308.16692v2
- Date: Tue, 23 Jan 2024 01:56:57 GMT
- Title: SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language
Models
- Authors: Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, Xipeng Qiu
- Abstract summary: Existing speech tokens are not specifically designed for speech language modeling.
We propose SpeechTokenizer, a unified speech tokenizer for speech large language models.
Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark.
- Score: 58.996653700982556
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Current speech large language models build upon discrete speech
representations, which can be categorized into semantic tokens and acoustic
tokens. However, existing speech tokens are not specifically designed for
speech language modeling. To assess the suitability of speech tokens for
building speech language models, we established the first benchmark,
SLMTokBench. Our results indicate that neither semantic nor acoustic tokens are
ideal for this purpose. Therefore, we propose SpeechTokenizer, a unified speech
tokenizer for speech large language models. SpeechTokenizer adopts the
Encoder-Decoder architecture with residual vector quantization (RVQ). Unifying
semantic and acoustic tokens, SpeechTokenizer disentangles different aspects of
speech information hierarchically across different RVQ layers. Furthermore, We
construct a Unified Speech Language Model (USLM) leveraging SpeechTokenizer.
Experiments show that SpeechTokenizer performs comparably to EnCodec in speech
reconstruction and demonstrates strong performance on the SLMTokBench
benchmark. Also, USLM outperforms VALL-E in zero-shot Text-to-Speech tasks.
Code and models are available at
https://github.com/ZhangXInFD/SpeechTokenizer/.
Related papers
- DC-Spin: A Speaker-invariant Speech Tokenizer for Spoken Language Models [45.791472119671916]
Spoken language models (SLMs) process text and speech, enabling simultaneous speech understanding and generation.
DC-Spin aims to improve speech tokenization by bridging audio signals and SLM tokens.
We propose a chunk-wise approach to enable streamable DC-Spin without retraining and degradation.
arXiv Detail & Related papers (2024-10-31T17:43:13Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - HierSpeech++: Bridging the Gap between Semantic and Acoustic
Representation of Speech by Hierarchical Variational Inference for Zero-shot
Speech Synthesis [39.892633589217326]
Large language models (LLM)-based speech synthesis has been widely adopted in zero-shot speech synthesis.
This paper proposes HierSpeech++, a fast and strong zero-shot speech synthesizer for text-to-speech (TTS) and voice conversion (VC)
arXiv Detail & Related papers (2023-11-21T09:07:11Z) - RepCodec: A Speech Representation Codec for Speech Tokenization [21.60885344868044]
RepCodec is a novel representation for semantic speech tokenization.
We show that RepCodec significantly outperforms the widely used k-means clustering approach in both speech understanding and generation.
arXiv Detail & Related papers (2023-08-31T23:26:10Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - SpeechGen: Unlocking the Generative Power of Speech Language Models with
Prompts [108.04306136086807]
We present research that explores the application of prompt tuning to stimulate speech LMs for various generation tasks, within a unified framework called SpeechGen.
The proposed unified framework holds great promise for efficiency and effectiveness, particularly with the imminent arrival of advanced speech LMs.
arXiv Detail & Related papers (2023-06-03T22:35:27Z) - SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation.
Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities.
We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z) - UWSpeech: Speech to Speech Translation for Unwritten Languages [145.37116196042282]
We develop a translation system for unwritten languages, named as UWSpeech, which converts target unwritten speech into discrete tokens with a converter.
We propose a method called XL-VAE, which enhances vector quantized variational autoencoder (VQ-VAE) with cross-lingual (XL) speech recognition.
Experiments on Fisher Spanish-English conversation translation dataset show that UWSpeech outperforms direct translation and VQ-VAE baseline by about 16 and 10 BLEU points respectively.
arXiv Detail & Related papers (2020-06-14T15:22:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.