Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
- URL: http://arxiv.org/abs/2507.09834v1
- Date: Mon, 14 Jul 2025 00:14:54 GMT
- Title: Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction
- Authors: Shu-wen Yang, Byeonggeun Kim, Kuan-Po Huang, Qingming Tang, Huy Phan, Bo-Ru Lu, Harsha Sundar, Shalini Ghosh, Hung-yi Lee, Chieh-Chi Kao, Chao Wang,
- Abstract summary: We research audio generation with a causal language model (LM) without discrete tokens.<n>We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token.<n>We propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework.
- Score: 63.26850431270348
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Autoregressive next-token prediction with the Transformer decoder has become a de facto standard in large language models (LLMs), achieving remarkable success in Natural Language Processing (NLP) at scale. Extending this paradigm to audio poses unique challenges due to its inherently continuous nature. We research audio generation with a causal language model (LM) without discrete tokens. We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token. Our approach delivers significant improvements over previous discrete solution, AudioGen, achieving 20% and 40% relative gains on AudioCaps in Frechet Audio Distance (FAD) and Kullback-Leibler (KL) divergence, respectively. Additionally, we propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework. On AudioCaps, the innovation yields 41% and 33% relative FAD improvements over AudioGen Base (285M) and AudioGen Large (1B) models, respectively, and is on par with the state-of-the-art (SOTA) diffusion models. Furthermore, we achieve these results with significantly fewer parameters -- 193M for our Base and 462M for our Large models.
Related papers
- Next Tokens Denoising for Speech Synthesis [51.320443764269726]
Dragon-FM is a novel text-to-speech (TTS) design that unifies AR and flow-matching.<n>It processes 48 kHz audio tokens in chunks at a compact rate of 12.5 tokens per second.<n>Experiments on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.
arXiv Detail & Related papers (2025-07-30T15:03:36Z) - Token-based Audio Inpainting via Discrete Diffusion [14.23046540809056]
We introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations.<n>Our approach models the generative process directly in the discrete latent space, enabling stable and semantically coherent reconstruction of missing audio.
arXiv Detail & Related papers (2025-07-11T06:25:49Z) - WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling [63.8735398698683]
A crucial component of language models is the tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens.<n>We introduce WavTokenizer, which offers several advantages over previous SOTA acoustic models in the audio domain.<n>WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information.
arXiv Detail & Related papers (2024-08-29T13:43:36Z) - Efficient Autoregressive Audio Modeling via Next-Scale Prediction [52.663934477127405]
We analyze the token length of audio tokenization and propose a novel textbfScale-level textbfAudio textbfTokenizer (SAT)<n>Based on SAT, a scale-level textbfAcoustic textbfAutotextbfRegressive (AAR) modeling framework is proposed, which shifts the next-token AR prediction to next-scale AR prediction.
arXiv Detail & Related papers (2024-08-16T21:48:53Z) - Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [39.32761051774537]
We propose encoding audio as vector sequences in continuous space $mathbb Rd$ and autoregressively generating these sequences.
High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing.
arXiv Detail & Related papers (2024-06-08T18:57:13Z) - Masked Audio Generation using a Single Non-Autoregressive Transformer [90.11646612273965]
MAGNeT is a masked generative sequence modeling method that operates directly over several streams of audio tokens.
We demonstrate the efficiency of MAGNeT for the task of text-to-music and text-to-audio generation.
We shed light on the importance of each of the components comprising MAGNeT, together with pointing to the trade-offs between autoregressive and non-autoregressive modeling.
arXiv Detail & Related papers (2024-01-09T14:29:39Z) - BEATs: Audio Pre-Training with Acoustic Tokenizers [77.8510930885778]
Self-supervised learning (SSL) has been witnessed in language, vision, speech, and audio domains over the past few years.
We propose BEATs, an iterative audio pre-training framework to learn Bidirectional representation from Audio Transformers.
In the first iteration, we use random projection as the acoustic tokenizer to train an audio SSL model in a mask and label prediction manner.
Then, we train an acoustic tokenizer for the next iteration by distilling the semantic knowledge from the pre-trained or fine-tuned audio SSL model.
arXiv Detail & Related papers (2022-12-18T10:41:55Z) - Audio-visual speech enhancement with a deep Kalman filter generative
model [0.0]
We present an audiovisual deep Kalman filter (AV-DKF) generative model which assumes a first-order Markov chain model for the latent variables.
We develop an efficient inference methodology to estimate speech signals at test time.
arXiv Detail & Related papers (2022-11-02T09:50:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.