Related papers: Efficient Autoregressive Audio Modeling via Next-Scale Prediction

Efficient Autoregressive Audio Modeling via Next-Scale Prediction

URL: http://arxiv.org/abs/2408.09027v1
Date: Fri, 16 Aug 2024 21:48:53 GMT
Title: Efficient Autoregressive Audio Modeling via Next-Scale Prediction
Authors: Kai Qiu, Xiang Li, Hao Chen, Jie Sun, Jinglu Wang, Zhe Lin, Marios Savvides, Bhiksha Raj,
Abstract summary: We analyze the token length of audio tokenization and propose a novel textbfScale-level textbfAudio textbfTokenizer (SAT) Based on SAT, a scale-level textbfAcoustic textbfAutotextbfRegressive (AAR) modeling framework is proposed, which shifts the next-token AR prediction to next-scale AR prediction.
Score: 52.663934477127405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio generation has achieved remarkable progress with the advance of sophisticated generative models, such as diffusion models (DMs) and autoregressive (AR) models. However, due to the naturally significant sequence length of audio, the efficiency of audio generation remains an essential issue to be addressed, especially for AR models that are incorporated in large language models (LLMs). In this paper, we analyze the token length of audio tokenization and propose a novel \textbf{S}cale-level \textbf{A}udio \textbf{T}okenizer (SAT), with improved residual quantization. Based on SAT, a scale-level \textbf{A}coustic \textbf{A}uto\textbf{R}egressive (AAR) modeling framework is further proposed, which shifts the next-token AR prediction to next-scale AR prediction, significantly reducing the training cost and inference time. To validate the effectiveness of the proposed approach, we comprehensively analyze design choices and demonstrate the proposed AAR framework achieves a remarkable \textbf{35}$\times$ faster inference speed and +\textbf{1.33} Fr\'echet Audio Distance (FAD) against baselines on the AudioSet benchmark. Code: \url{https://github.com/qiuk2/AAR}.

Related papers

Next Tokens Denoising for Speech Synthesis [51.320443764269726]
Dragon-FM is a novel text-to-speech (TTS) design that unifies AR and flow-matching.<n>It processes 48 kHz audio tokens in chunks at a compact rate of 12.5 tokens per second.<n>Experiments on podcast datasets demonstrate its capability to efficiently generate high-quality zero-shot podcasts.
arXiv Detail & Related papers (2025-07-30T15:03:36Z)
Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations [26.938560887095658]
Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss.<n>We propose QTTS, a novel TTS framework built upon our new audio, QDAC.<n>Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline.
arXiv Detail & Related papers (2025-07-16T12:47:09Z)
Generative Audio Language Modeling with Continuous-valued Tokens and Masked Next-Token Prediction [63.26850431270348]
We research audio generation with a causal language model (LM) without discrete tokens.<n>We leverage token-wise diffusion to model the continuous distribution of the next continuous-valued token.<n>We propose a novel masked next-token prediction task that incorporates masked prediction into the causal LM framework.
arXiv Detail & Related papers (2025-07-14T00:14:54Z)
Robust and Efficient Autoregressive Speech Synthesis with Dynamic Chunk-wise Prediction Policy [20.962236229450454]
We introduce a novel dynamic chunk-wise autoregressive synthesis framework, termed DCAR, to enhance both efficiency and intelligibility in AR speech generation.<n>DCAR substantially outperforms traditional next-token prediction models, achieving up to 72.27% intelligibility improvement and 2.61x inference speedup simultaneously on the test set.
arXiv Detail & Related papers (2025-06-27T08:45:21Z)
Elucidated Rolling Diffusion Models for Probabilistic Weather Forecasting [52.6508222408558]
We introduce Elucidated Rolling Diffusion Models (ERDM)<n>ERDM is the first framework to unify a rolling forecast structure with the principled, performant design of Elucidated Diffusion Models (EDM)<n>On 2D Navier-Stokes simulations and ERA5 global weather forecasting at 1.5circ resolution, ERDM consistently outperforms key diffusion-based baselines.
arXiv Detail & Related papers (2025-06-24T21:44:31Z)
Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis [64.12708207721276]
We introduce a novel pseudo-autoregressive (PAR) language modeling approach that unifies AR and NAR modeling. Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement. Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data.
arXiv Detail & Related papers (2025-04-14T16:03:21Z)
Apollo-Forecast: Overcoming Aliasing and Inference Speed Challenges in Language Models for Time Series Forecasting [16.177920916883565]
Anti-Aliasing Quantization Module (AAQM) and Race Decoding (RD) technique are presented. AAQM adeptly encodes sequences into tokens while mitigating high-frequency noise in the original signals. RD employs a draft model to enable parallel processing and results integration, which markedly accelerates the inference speed for long-term predictions.
arXiv Detail & Related papers (2024-12-16T11:01:20Z)
Tailored Design of Audio-Visual Speech Recognition Models using Branchformers [0.0]
We propose a novel framework for the design of parameter-efficient Audio-Visual Speech Recognition systems. To be more precise, the proposed framework consists of two steps: first, estimating audio- and video-only systems, and then designing a tailored audio-visual unified encoder. Results reflect how our tailored AVSR system is able to reach state-of-the-art recognition rates.
arXiv Detail & Related papers (2024-07-09T07:15:56Z)
VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system. This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z)
An Experimental Study: Assessing the Combined Framework of WavLM and BEST-RQ for Text-to-Speech Synthesis [0.5076419064097734]
We propose a new model architecture specifically suited for text-to-speech (TTS) models. We combine WavLM, a pre-trained self-supervised learning (SSL) speech model, and the BEST-RQ vector quantization framework. Experiments on the LibriSpeech dataset with SUPERB benchmarking assert that the proposed model significantly underperforms.
arXiv Detail & Related papers (2023-12-08T23:59:25Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Bring the Noise: Introducing Noise Robustness to Pretrained Automatic Speech Recognition [13.53738829631595]
We propose a novel method to extract the denoising capabilities, that can be applied to any encoder-decoder architecture. We train our pre-processor on the Noisy Speech Database (NSD) to reconstruct denoised spectrograms from noisy inputs. We show that the Cleancoder is able to filter noise from speech and that it improves the total Word Error Rate (WER) of the downstream model in noisy conditions.
arXiv Detail & Related papers (2023-09-05T11:34:21Z)
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z)
Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation. Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data. We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z)
A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals. The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
Efficiently Fusing Pretrained Acoustic and Linguistic Encoders for Low-resource Speech Recognition [9.732767611907068]
In this work, we fuse a pre-trained acoustic encoder (wav2vec2.0) and a pre-trained linguistic encoder (BERT) into an end-to-end ASR model. Our model achieves better recognition performance on CALLHOME corpus (15 hours) than other end-to-end models.
arXiv Detail & Related papers (2021-01-17T16:12:44Z)
Hypothesis Stitcher for End-to-End Speaker-attributed ASR on Long-form Multi-talker Recordings [42.17790794610591]
An end-to-end (E2E) speaker-attributed automatic speech recognition (SA-ASR) model was proposed recently to jointly perform speaker counting, speech recognition and speaker identification. The model achieved a low speaker-attributed word error rate (SA-WER) for monaural overlapped speech comprising an unknown number of speakers. It has yet to be investigated whether the E2E SA-ASR model works well for recordings that are much longer than samples seen during training.
arXiv Detail & Related papers (2021-01-06T03:36:09Z)
Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.