Related papers: Benchmarking Prosody Encoding in Discrete Speech Tokens

Benchmarking Prosody Encoding in Discrete Speech Tokens

URL: http://arxiv.org/abs/2508.11224v1
Date: Fri, 15 Aug 2025 05:11:16 GMT
Title: Benchmarking Prosody Encoding in Discrete Speech Tokens
Authors: Kentaro Onda, Satoru Fukayama, Daisuke Saito, Nobuaki Minematsu,
Abstract summary: This study focuses on prosodic encoding based on their sensitivity to the artificially modified prosody, aiming to provide practical guidelines for designing discrete tokens.<n>In particular, speech language models are expected to understand and generate responses that reflect not only the semantic content but also prosodic features.
Score: 13.60092490447892
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, discrete tokens derived from self-supervised learning (SSL) models via k-means clustering have been actively studied as pseudo-text in speech language models and as efficient intermediate representations for various tasks. However, these discrete tokens are typically learned in advance, separately from the training of language models or downstream tasks. As a result, choices related to discretization, such as the SSL model used or the number of clusters, must be made heuristically. In particular, speech language models are expected to understand and generate responses that reflect not only the semantic content but also prosodic features. Yet, there has been limited research on the ability of discrete tokens to capture prosodic information. To address this gap, this study conducts a comprehensive analysis focusing on prosodic encoding based on their sensitivity to the artificially modified prosody, aiming to provide practical guidelines for designing discrete tokens.

Related papers

ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z)
A Variational Framework for Improving Naturalness in Generative Spoken Language Models [52.673912922590866]
We propose an end-to-end variational approach that automatically learns to encode continuous speech attributes to enhance semantic tokens.<n>Our approach eliminates the need for manual extraction and selection of paralinguistic features.<n>It produces preferred speech continuations according to human raters.
arXiv Detail & Related papers (2025-06-17T17:58:17Z)
Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations [23.059241057567956]
This paper unifies two types of tokens and proposes the UniCodec, a universal speech token learning that encapsulates all semantics of speech.<n>A low-bitrate neural is leveraged to learn such disentangled discrete representations at global and local scales, with knowledge distilled from self-supervised learned features.
arXiv Detail & Related papers (2025-03-15T12:50:43Z)
Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning [53.57895922042783]
Large Language Models (LLMs) excel at reasoning and planning when trained on chainof-thought (CoT) data.<n>We propose a hybrid representation of the reasoning process, where we partially abstract away the initial reasoning steps using latent discrete tokens.
arXiv Detail & Related papers (2025-02-05T15:33:00Z)
Signs as Tokens: A Retrieval-Enhanced Multilingual Sign Language Generator [55.94334001112357]
We introduce a multilingual sign language model, Signs as Tokens (SOKE), which can generate 3D sign avatars autoregressively from text inputs.<n>We propose a retrieval-enhanced SLG approach, which incorporates external sign dictionaries to provide accurate word-level signs.
arXiv Detail & Related papers (2024-11-26T18:28:09Z)
A Comparative Study of Discrete Speech Tokens for Semantic-Related Tasks with Large Language Models [46.298114175792584]
We present a fair and thorough comparison between discrete and continuous features across a variety of semantic-related tasks. Our findings reveal that continuous features generally outperform discrete tokens, particularly in tasks requiring fine-grained semantic understanding.
arXiv Detail & Related papers (2024-11-13T16:20:20Z)
Do Discrete Self-Supervised Representations of Speech Capture Tone Distinctions? [13.197705351799215]
We evaluate whether discrete symbols adequately capture tone in two example languages, Mandarin and Yoruba. We find that using discrete symbols leads to a substantial loss of tone information, even for language-specialised SSL models.
arXiv Detail & Related papers (2024-10-25T19:13:25Z)
A Comparative Study of Continuous Sign Language Recognition Techniques [1.534667887016089]
Continuous Sign Language Recognition (CSLR) focuses on the interpretation of a sequence of sign language gestures performed continually without pauses. In this study, we conduct an empirical evaluation of recent deep learning C SLR techniques and assess their performance across various datasets and sign languages.
arXiv Detail & Related papers (2024-06-18T07:51:44Z)
How Should We Extract Discrete Audio Tokens from Self-Supervised Models? [15.03039528965825]
This paper explores the optimal configuration of semantic tokens across discriminative and generative tasks. We propose a scalable solution to train a universal vocoder across multiple SSL layers.
arXiv Detail & Related papers (2024-06-15T20:43:07Z)
Identifying and Analyzing Performance-Critical Tokens in Large Language Models [52.404072802235234]
We study how large language models learn to perform tasks from demonstrations.<n>Our work sheds light on how large language models learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in large language models.
arXiv Detail & Related papers (2024-01-20T20:55:21Z)
Interpreting Language Models with Contrastive Explanations [99.7035899290924]
Language models must consider various features to predict a token, such as its part of speech, number, tense, or semantics. Existing explanation methods conflate evidence for all these features into a single explanation, which is less interpretable for human understanding. We show that contrastive explanations are quantifiably better than non-contrastive explanations in verifying major grammatical phenomena.
arXiv Detail & Related papers (2022-02-21T18:32:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.