Optimizing Speech Language Models for Acoustic Consistency
- URL: http://arxiv.org/abs/2509.26276v1
- Date: Tue, 30 Sep 2025 13:59:52 GMT
- Title: Optimizing Speech Language Models for Acoustic Consistency
- Authors: Morteza Rohanian, Michael Krauthammer,
- Abstract summary: We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech.<n>Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives.
- Score: 2.5864269455844484
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: We study speech language models that incorporate semantic initialization and planning losses to achieve robust and consistent generation. Our approach initializes speech tokens with self-supervised features, applies a light alignment loss, and trains with thinning and auxiliary objectives that target robustness and content planning. We train three models: a 0.7B speech-only model, a 1.0B speech-only model, and a 1.0B interleaved model with both text and speech. Acoustic studies show that the speech-only models achieve the highest consistency across speaker, gender, sentiment, room, and background factors, surpassing larger systems. Interleaving improves lexical and syntactic probes and semantic--acoustic alignment but reduces consistency. Linear probes show that our initialization biases the model toward content structure while trading off prosody detail. These results show that LM-side design and training mix control the balance between acoustic stability and semantic grounding without changes to the tokenizer or runtime architecture. A demo and model weights are available for exploration.
Related papers
- Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens [62.56027815951259]
Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens.<n>This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale.
arXiv Detail & Related papers (2026-02-18T18:32:46Z) - ELEGANCE: Efficient LLM Guidance for Audio-Visual Target Speech Extraction [88.41471266579333]
We propose ELEGANCE, a novel framework that incorporates linguistic knowledge from large language models (LLMs) into AV-TSE models.<n> Comprehensive experiments with RoBERTa, Qwen3-0.6B, and Qwen3-4B on two AV-TSE backbones show significant improvements.
arXiv Detail & Related papers (2025-11-09T08:50:11Z) - DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech [8.537791317883576]
Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample.<n>Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts.<n>We introduce DiFlow-TTS, which is the first model to explore purely Discrete Flow Matching for speech synthesis.
arXiv Detail & Related papers (2025-09-11T17:16:52Z) - Tracking Articulatory Dynamics in Speech with a Fixed-Weight BiLSTM-CNN Architecture [0.0]
This paper presents a novel approach for predicting tongue and lip articulatory features involved in a given speech acoustics.<n>The proposed network is trained with two datasets consisting of simultaneously recorded speech and Electromagnetic Articulography (EMA) datasets.
arXiv Detail & Related papers (2025-04-25T05:57:22Z) - TASTE: Text-Aligned Speech Tokenization and Embedding for Spoken Language Modeling [46.60911294356232]
We introduce Text-Aligned Speech Tokenization and Embedding (TASTE) to align speech token with corresponding text transcription during the tokenization stage.<n>We conduct extensive experiments and show that TASTE can preserve essential paralinguistic information while dramatically reducing the token sequence length.<n> Experimental results show that TASTE-based SLMs perform comparable to previous work on SALMON and StoryCloze.
arXiv Detail & Related papers (2025-04-09T17:14:33Z) - Sylber: Syllabic Embedding Representation of Speech from Raw Audio [25.703703711031178]
We propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure.<n>Specifically, we propose a self-supervised learning framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation.<n>This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling.
arXiv Detail & Related papers (2024-10-09T17:59:04Z) - Leveraging the Interplay Between Syntactic and Acoustic Cues for Optimizing Korean TTS Pause Formation [6.225927189801006]
We propose a novel framework that incorporates comprehensive modeling of both syntactic and acoustic cues that are associated with pausing patterns.
Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences.
arXiv Detail & Related papers (2024-04-03T09:17:38Z) - Integrating Self-supervised Speech Model with Pseudo Word-level Targets
from Visually-grounded Speech Model [57.78191634042409]
We propose Pseudo-Word HuBERT (PW-HuBERT), a framework that integrates pseudo word-level targets into the training process.
Our experimental results on four spoken language understanding (SLU) benchmarks suggest the superiority of our model in capturing semantic information.
arXiv Detail & Related papers (2024-02-08T16:55:21Z) - Self-Supervised Models of Speech Infer Universal Articulatory Kinematics [44.27187669492598]
We show "inference of articulatory kinematics" as fundamental property of SSL models.
We also show that this abstraction is largely overlapping across the language of the data used to train the model.
We show that with simple affine transformations, Acoustic-to-Articulatory inversion (AAI) is transferrable across speakers, even across genders, languages, and dialects.
arXiv Detail & Related papers (2023-10-16T19:50:01Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - Towards Language Modelling in the Speech Domain Using Sub-word
Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes.
With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech.
We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z) - SPLAT: Speech-Language Joint Pre-Training for Spoken Language
Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions.
Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text.
We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.