Related papers: Sylber: Syllabic Embedding Representation of Speech from Raw Audio

Sylber: Syllabic Embedding Representation of Speech from Raw Audio

URL: http://arxiv.org/abs/2410.07168v2
Date: Sun, 02 Mar 2025 09:16:05 GMT
Title: Sylber: Syllabic Embedding Representation of Speech from Raw Audio
Authors: Cheol Jun Cho, Nicholas Lee, Akshat Gupta, Dhruv Agarwal, Ethan Chen, Alan W Black, Gopala K. Anumanchipalli,
Abstract summary: We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.<n>Specifically, we propose a self-supervised learning framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation.<n>This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling.
Score: 25.703703711031178
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Syllables are compositional units of spoken language that efficiently structure human speech perception and production. However, current neural speech representations lack such structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised learning (SSL) framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling. Our proposed segmentation method is highly robust and generalizes to out-of-domain data and unseen languages without any tuning. By training token-to-speech generative models, fully intelligible speech can be reconstructed from Sylber tokens with a significantly lower bitrate than baseline SSL tokens. This suggests that our model effectively compresses speech into a compact sequence of tokens with minimal information loss. Lastly, we demonstrate that categorical perception-a linguistic phenomenon in speech perception-emerges naturally in Sylber, making the embedding space more categorical and sparse than previous speech features and thus supporting the high efficiency of our tokenization. Together, we present a novel SSL approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.

Related papers

Frontend Token Enhancement for Token-Based Speech Recognition [50.35062963870211]
Discretized representations of speech signals are efficient alternatives to continuous features for speech recognition applications.<n>In this work, we introduce a system that estimates clean speech tokens from noisy speech and evaluate it on an ASR backend using semantic tokens.<n>We consider four types of enhancement models based on their input/token domains: wave-to-wave, token-to-output, continuous SSL features-to-token, and wave-to-token.
arXiv Detail & Related papers (2026-02-04T05:02:15Z)
Latent Speech-Text Transformer [77.01648186958381]
We introduce the Latent Speech-Text Transformer (LST), which makes pre-training speech-text models more data-efficient.<n>LST outperforms vanilla approaches on speech-to-speech as well as text-to-text benchmarks in both data- and compute-controlled settings.
arXiv Detail & Related papers (2025-10-07T17:52:08Z)
Scaling Spoken Language Models with Syllabic Speech Tokenization [17.835120807367677]
Spoken language models (SLMs) typically discretize speech into high-frame-rate tokens extracted from SSL speech models.<n>Recent SSL work introduces acoustic tokenization of speech at the syllable level.<n>Syllabic tokens can match or surpass the previous high-frame rate tokens while significantly cutting training and inference costs.
arXiv Detail & Related papers (2025-09-30T17:59:09Z)
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z)
A Variational Framework for Improving Naturalness in Generative Spoken Language Models [52.673912922590866]
We propose an end-to-end variational approach that automatically learns to encode continuous speech attributes to enhance semantic tokens.<n>Our approach eliminates the need for manual extraction and selection of paralinguistic features.<n>It produces preferred speech continuations according to human raters.
arXiv Detail & Related papers (2025-06-17T17:58:17Z)
Exploring the Effect of Segmentation and Vocabulary Size on Speech Tokenization for Speech Language Models [16.1461487947151]
Speech tokenization transforms a speech signal into a sequence of discrete representations.<n>This paper investigates two key aspects of speech tokenization: the segmentation width and the cluster size of discrete units.
arXiv Detail & Related papers (2025-05-23T04:03:27Z)
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations [23.059241057567956]
This paper unifies two types of tokens and proposes the UniCodec, a universal speech token learning that encapsulates all semantics of speech. A low-bitrate neural is leveraged to learn such disentangled discrete representations at global and local scales, with knowledge distilled from self-supervised learned features.
arXiv Detail & Related papers (2025-03-15T12:50:43Z)
SyllableLM: Learning Coarse Semantic Units for Speech Language Models [21.762112843104028]
We introduce a controllable self-supervised technique to merge speech representations into coarser syllable-like units. Our method produces controllable-rate semantic units at as low as 5Hz and 60bps and SotA inc segmentation and clustering. SyllableLM achieves significant improvements in efficiency with a 30x reduction in training compute and a 4x wall-clock inference speedup.
arXiv Detail & Related papers (2024-10-05T04:29:55Z)
Moshi: a speech-text foundation model for real-time dialogue [78.88479749811376]
Current systems for spoken dialogue rely on pipelines independent voice activity detection and text-to-speech. We show how Moshi Moshi can provide streaming speech recognition and text-to-speech. Our resulting model is first real-time full spoken large language model modality.
arXiv Detail & Related papers (2024-09-17T17:55:39Z)
Improving Spoken Language Modeling with Phoneme Classification: A Simple Fine-tuning Approach [14.5696754689252]
Recent progress in Spoken Language Modeling has shown that learning language directly from speech is feasible. We show that fine-tuning speech representation models on phoneme classification leads to more context-invariant representations.
arXiv Detail & Related papers (2024-09-16T10:29:15Z)
Self-Supervised Syllable Discovery Based on Speaker-Disentangled HuBERT [10.18337180909434]
Self-supervised speech representation learning has become essential for extracting meaningful features from untranscribed audio. We propose a speech-only self-supervised fine-tuning approach that separates syllabic units from speaker information.
arXiv Detail & Related papers (2024-09-16T09:07:08Z)
dMel: Speech Tokenization made Simple [19.169460770473908]
We show that discretizing mel-filterbank channels into discrete intensity bins produces a simple representation (dMel) Our results demonstrate the effectiveness of dMel in achieving high performance on both tasks within a unified framework.
arXiv Detail & Related papers (2024-07-22T17:51:53Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
SpeechAlign: Aligning Speech Generation to Human Preferences [51.684183257809075]
We introduce SpeechAlign, an iterative self-improvement strategy that aligns speech language models to human preferences. We show that SpeechAlign can bridge the distribution gap and facilitate continuous self-improvement of the speech language model.
arXiv Detail & Related papers (2024-04-08T15:21:17Z)
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT [49.06057768982775]
We show that a syllabic organization emerges in learning sentence-level representation of speech. We propose a new benchmark task, Spoken Speech ABX, for evaluating sentence-level representation of speech.
arXiv Detail & Related papers (2023-10-16T20:05:36Z)
SpeechTokenizer: Unified Speech Tokenizer for Speech Large Language Models [58.996653700982556]
Existing speech tokens are not specifically designed for speech language modeling. We propose SpeechTokenizer, a unified speech tokenizer for speech large language models. Experiments show that SpeechTokenizer performs comparably to EnCodec in speech reconstruction and demonstrates strong performance on the SLMTokBench benchmark.
arXiv Detail & Related papers (2023-08-31T12:53:09Z)
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model [21.286529902957724]
We show that representations capturing syllabic units emerge when training a self-supervised speech model with a visually-grounded training objective. We show that our model not only outperforms a state-of-the-art syllabic segmentation method on the language it was trained on (English), but also generalizes in a zero-shot fashion to Estonian.
arXiv Detail & Related papers (2023-05-19T05:19:04Z)
token2vec: A Joint Self-Supervised Pre-training Framework Using Unpaired Speech and Text [65.04385919645395]
token2vec is a novel joint pre-training framework for unpaired speech and text based on discrete representations of speech. Experiments show that token2vec is significantly superior to various speech-only pre-training baselines, with up to 17.7% relative WER reduction.
arXiv Detail & Related papers (2022-10-30T06:38:19Z)
SpeechLM: Enhanced Speech Pre-Training with Unpaired Textual Data [100.46303484627045]
We propose a cross-modal Speech and Language Model (SpeechLM) to align speech and text pre-training with a pre-defined unified representation. Specifically, we introduce two alternative discrete tokenizers to bridge the speech and text modalities. We evaluate SpeechLM on various spoken language processing tasks including speech recognition, speech translation, and universal representation evaluation framework SUPERB.
arXiv Detail & Related papers (2022-09-30T09:12:10Z)
TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation. We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices. TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z)
Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains. Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods. This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z)
Tokenwise Contrastive Pretraining for Finer Speech-to-BERT Alignment in End-to-End Speech-to-Intent Systems [31.18865184576272]
This work is a step towards doing the same in a much more efficient and fine-grained manner where we align speech embeddings and BERT embeddings on a token-by-token basis. We introduce a simple yet novel technique that uses a cross-modal attention mechanism to extract token-level contextual embeddings from a speech encoder. Fine-tuning such a pretrained model to perform intent recognition using speech directly yields state-of-the-art performance on two widely used SLU datasets.
arXiv Detail & Related papers (2022-04-11T15:24:25Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way. Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously. We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video. We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.