Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding
- URL: http://arxiv.org/abs/2601.14304v1
- Date: Sun, 18 Jan 2026 07:26:23 GMT
- Title: Guided by the Plan: Enhancing Faithful Autoregressive Text-to-Audio Generation with Guided Decoding
- Authors: Juncheng Wang, Zhe Hu, Chao Xu, Siyue Ren, Yuxiang Feng, Yang Liu, Baigui Sun, Shujun Wang,
- Abstract summary: Plan-Critic is a lightweight auxiliary model trained with aGAE-inspired objective to predict final instruction-following quality from partial generations.<n>Plan-Critic achieves up to a 10-point improvement in CLAP score over the AR baseline.<n>This work bridges the gap between causal generation and global semantic alignment, demonstrating that even strictly autoregressive models can plan ahead.
- Score: 25.824708878012753
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Autoregressive (AR) models excel at generating temporally coherent audio by producing tokens sequentially, yet they often falter in faithfully following complex textual prompts, especially those describing complex sound events. We uncover a surprising capability in AR audio generators: their early prefix tokens implicitly encode global semantic attributes of the final output, such as event count and sound-object category, revealing a form of implicit planning. Building on this insight, we propose Plan-Critic, a lightweight auxiliary model trained with a Generalized Advantage Estimation (GAE)-inspired objective to predict final instruction-following quality from partial generations. At inference time, Plan-Critic enables guided exploration: it evaluates candidate prefixes early, prunes low-fidelity trajectories, and reallocates computation to high-potential planning seeds. Our Plan-Critic-guided sampling achieves up to a 10-point improvement in CLAP score over the AR baseline-establishing a new state of the art in AR text-to-audio generation-while maintaining computational parity with standard best-of-N decoding. This work bridges the gap between causal generation and global semantic alignment, demonstrating that even strictly autoregressive models can plan ahead.
Related papers
- StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - HiGFA: Hierarchical Guidance for Fine-grained Data Augmentation with Diffusion Models [82.10385962490051]
Generative diffusion models show promise for data augmentation.<n>Applying them to fine-grained tasks presents a significant challenge.<n>HiGFA is a hierarchical, confidence-driven orchestration that generates diverse yet faithful synthetic images.
arXiv Detail & Related papers (2025-11-16T10:46:16Z) - Speech Recognition Model Improves Text-to-Speech Synthesis using Fine-Grained Reward [4.375679183191095]
We introduce Word-level TTS Alignment by ASR-driven Attentive Reward (W3AR)<n>W3AR uses attention from a pre-trained ASR model to drive finer-grained alignment and optimization of sequences predicted by a TTS model.<n> Experiments show that W3AR improves the quality of existing TTS systems and strengthens zero-shot robustness on unseen speakers.
arXiv Detail & Related papers (2025-11-12T17:30:13Z) - Quantize More, Lose Less: Autoregressive Generation from Residually Quantized Speech Representations [26.938560887095658]
Existing autoregressive approaches often rely on single-codebook representations, which suffer from significant information loss.<n>We propose QTTS, a novel TTS framework built upon our new audio, QDAC.<n>Our experiments demonstrate that the proposed framework achieves higher synthesis quality and better preserves expressive content compared to baseline.
arXiv Detail & Related papers (2025-07-16T12:47:09Z) - CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment [76.32508013503653]
We propose CAV-MAE Sync as a simple yet effective extension of the original CAV-MAE framework for self-supervised audio-visual learning.<n>We tackle the mismatch between modalities by treating audio as a temporal sequence aligned with video frames, rather than using global representations.<n>We improve spatial localization by introducing learnable register tokens that reduce semantic load on patch tokens.
arXiv Detail & Related papers (2025-05-02T12:59:58Z) - Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis [64.12708207721276]
We introduce a novel pseudo-autoregressive (PAR) language modeling approach that unifies AR and NAR modeling.<n>Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement.<n>Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data.
arXiv Detail & Related papers (2025-04-14T16:03:21Z) - Efficient Autoregressive Audio Modeling via Next-Scale Prediction [52.663934477127405]
We analyze the token length of audio tokenization and propose a novel textbfScale-level textbfAudio textbfTokenizer (SAT)<n>Based on SAT, a scale-level textbfAcoustic textbfAutotextbfRegressive (AAR) modeling framework is proposed, which shifts the next-token AR prediction to next-scale AR prediction.
arXiv Detail & Related papers (2024-08-16T21:48:53Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - A-JEPA: Joint-Embedding Predictive Architecture Can Listen [35.308323314848735]
We introduce Audio-based Joint-Embedding Predictive Architecture (A-JEPA), a simple extension method for self-supervised learning from the audio spectrum.
A-JEPA encodes visible audio spectrogram patches with a curriculum masking strategy via context encoder, and predicts the representations of regions sampled at well-designed locations.
arXiv Detail & Related papers (2023-11-27T13:53:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.