Related papers: Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling

URL: http://arxiv.org/abs/2508.09350v1
Date: Tue, 12 Aug 2025 21:25:37 GMT
Title: Flow-SLM: Joint Learning of Linguistic and Acoustic Information for Spoken Language Modeling
Authors: Ju-Chieh Chou, Jiawei Zhou, Karen Livescu,
Abstract summary: Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision.<n>We propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame.
Score: 23.374370061220763
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Textless spoken language models (SLMs) are generative models of speech that do not rely on text supervision. Most textless SLMs learn to predict the next semantic token, a discrete representation of linguistic content, and rely on a separate vocoder to add acoustic information to the generated speech. Such models have no access to acoustic context and no built-in control over acoustic details. In this work, we propose to jointly model linguistic and acoustic information by generating semantic tokens and a continuous real-valued representation of the acoustic frame. We use a flow-matching objective to predict the continuous vector conditioned on the semantic tokens. We study the design space of this approach and find that predicting multiple future semantic tokens helps preserve linguistic information. Our approach achieves comparable performance to existing models in terms of linguistic likelihood benchmarks, while providing better acoustic detail in prompted generation.

Related papers

Universal Speech Token Learning via Low-Bitrate Neural Codec and Pretrained Representations [32.83743219965261]
This paper unifies two types of tokens and proposes the UniCodec, a universal speech token learning that encapsulates all semantics of speech.<n>A low-bitrate neural is leveraged to learn such disentangled discrete representations at global and local scales, with knowledge distilled from self-supervised learned features.
arXiv Detail & Related papers (2025-03-15T12:50:43Z)
Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback [50.84142264245052]
This work introduces the Align-SLM framework to enhance the semantic understanding of textless Spoken Language Models (SLMs)<n>Our approach generates multiple speech continuations from a given prompt and uses semantic metrics to create preference data for Direct Preference Optimization (DPO)<n>We evaluate the framework using ZeroSpeech 2021 benchmarks for lexical and syntactic modeling, the spoken version of the StoryCloze dataset for semantic coherence, and other speech generation metrics, including the GPT4-o score and human evaluation.
arXiv Detail & Related papers (2024-11-04T06:07:53Z)
dMel: Speech Tokenization made Simple [16.679015298503593]
We introduce a novel speech representation (dmel) that discretizes mel-filterbank channels into intensity bins.<n>Our approach demonstrates superior performance in preserving audio content, robustness to out-of-domain data, and offers a training-free, natural, and streamable representation.
arXiv Detail & Related papers (2024-07-22T17:51:53Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
DeSTA: Enhancing Speech Language Models through Descriptive Speech-Text Alignment [82.86363991170546]
We propose a Descriptive Speech-Text Alignment approach that leverages speech captioning to bridge the gap between speech and text modalities. Our model demonstrates superior performance on the Dynamic-SUPERB benchmark, particularly in generalizing to unseen tasks. These findings highlight the potential to reshape instruction-following SLMs by incorporating descriptive rich, speech captions.
arXiv Detail & Related papers (2024-06-27T03:52:35Z)
Wave to Syntax: Probing spoken language models for syntax [16.643072915927313]
We focus on the encoding of syntax in several self-supervised and visually grounded models of spoken language. We show that syntax is captured most prominently in the middle layers of the networks, and more explicitly within models with more parameters.
arXiv Detail & Related papers (2023-05-30T11:43:18Z)
Integrating Form and Meaning: A Multi-Task Learning Model for Acoustic Word Embeddings [19.195728241989702]
We propose a multi-task learning model that incorporates top-down lexical knowledge into the training procedure of acoustic word embeddings. We experiment with three languages and demonstrate that incorporating lexical knowledge improves the embedding space discriminability.
arXiv Detail & Related papers (2022-09-14T13:33:04Z)
Towards Language Modelling in the Speech Domain Using Sub-word Linguistic Units [56.52704348773307]
We propose a novel LSTM-based generative speech LM based on linguistic units including syllables and phonemes. With a limited dataset, orders of magnitude smaller than that required by contemporary generative models, our model closely approximates babbling speech. We show the effect of training with auxiliary text LMs, multitask learning objectives, and auxiliary articulatory features.
arXiv Detail & Related papers (2021-10-31T22:48:30Z)
Wav-BERT: Cooperative Acoustic and Linguistic Representation Learning for Low-Resource Speech Recognition [159.9312272042253]
Wav-BERT is a cooperative acoustic and linguistic representation learning method. We unify a pre-trained acoustic model (wav2vec 2.0) and a language model (BERT) into an end-to-end trainable framework.
arXiv Detail & Related papers (2021-09-19T16:39:22Z)
SPLAT: Speech-Language Joint Pre-Training for Spoken Language Understanding [61.02342238771685]
Spoken language understanding requires a model to analyze input acoustic signal to understand its linguistic content and make predictions. Various pre-training methods have been proposed to learn rich representations from large-scale unannotated speech and text. We propose a novel semi-supervised learning framework, SPLAT, to jointly pre-train the speech and language modules.
arXiv Detail & Related papers (2020-10-05T19:29:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.