Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations
- URL: http://arxiv.org/abs/2503.12115v1
- Date: Sat, 15 Mar 2025 12:50:43 GMT
- Title: Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations
- Authors: Xue Jiang, Xiulian Peng, Yuan Zhang, Yan Lu,
- Abstract summary: This paper unifies two types of tokens and proposes the UniCodec, a universal speech token learning that encapsulates all semantics of speech.<n>A low-bitrate neural is leveraged to learn such disentangled discrete representations at global and local scales, with knowledge distilled from self-supervised learned features.
- Score: 23.059241057567956
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current large speech language models are mainly based on semantic tokens from discretization of self-supervised learned representations and acoustic tokens from a neural codec, following a semantic-modeling and acoustic-synthesis paradigm. However, semantic tokens discard paralinguistic attributes of speakers that is important for natural spoken communication, while prompt-based acoustic synthesis from semantic tokens has limits in recovering paralinguistic details and suffers from robustness issues, especially when there are domain gaps between the prompt and the target. This paper unifies two types of tokens and proposes the UniCodec, a universal speech token learning that encapsulates all semantics of speech, including linguistic and paralinguistic information, into a compact and semantically-disentangled unified token. Such a unified token can not only benefit speech language models in understanding with paralinguistic hints but also help speech generation with high-quality output. A low-bitrate neural codec is leveraged to learn such disentangled discrete representations at global and local scales, with knowledge distilled from self-supervised learned features. Extensive evaluations on multilingual datasets demonstrate its effectiveness in generating natural, expressive and long-term consistent output quality with paralinguistic attributes well preserved in several speech processing tasks.
Related papers
- DM-Codec: Distilling Multimodal Representations for Speech Tokenization [11.433520275513803]
DM-Codec is a language model-guided distillation method that incorporates contextual information.
It significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset.
arXiv Detail & Related papers (2024-10-19T07:14:14Z) - Sylber: Syllabic Embedding Representation of Speech from Raw Audio [25.703703711031178]
We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.<n>Specifically, we propose a self-supervised learning framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation.<n>This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - dMel: Speech Tokenization made Simple [19.169460770473908]
We show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel)
Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework.
arXiv Detail & Related papers (2024-07-22T17:51:53Z) - CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder.
Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z) - CLARA: Multilingual Contrastive Learning for Audio Representation
Acquisition [5.520654376217889]
CLARA minimizes reliance on labelled data, enhancing generalization across languages.
Our approach adeptly captures emotional nuances in speech, overcoming subjective assessment issues.
It adapts to low-resource languages, marking progress in multilingual speech representation learning.
arXiv Detail & Related papers (2023-10-18T09:31:56Z) - On decoder-only architecture for speech-to-text and large language model
integration [59.49886892602309]
Speech-LLaMA is a novel approach that effectively incorporates acoustic information into text-based large language models.
We conduct experiments on multilingual speech-to-text translation tasks and demonstrate a significant improvement over strong baselines.
arXiv Detail & Related papers (2023-07-08T06:47:58Z) - Disentangled Feature Learning for Real-Time Neural Speech Coding [24.751813940000993]
In this paper, instead of blind end-to-end learning, we propose to learn disentangled features for real-time neural speech coding.
We find that the learned disentangled features show comparable performance on any-to-any voice conversion with modern self-supervised speech representation learning models.
arXiv Detail & Related papers (2022-11-22T02:50:12Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning
for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method.
We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z) - Preliminary study on using vector quantization latent spaces for TTS/VC
systems with consistent performance [55.10864476206503]
We investigate the use of quantized vectors to model the latent linguistic embedding.
By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding.
Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations.
arXiv Detail & Related papers (2021-06-25T07:51:35Z) - CSTNet: Contrastive Speech Translation Network for Self-Supervised
Speech Representation Learning [11.552745999302905]
More than half of the 7,000 languages in the world are in imminent danger of going extinct.
It is relatively easy to obtain textual translations corresponding to speech.
We construct a convolutional neural network audio encoder capable of extracting linguistic representations from speech.
arXiv Detail & Related papers (2020-06-04T12:21:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.