GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis
- URL: http://arxiv.org/abs/2511.22293v1
- Date: Thu, 27 Nov 2025 10:18:56 GMT
- Title: GLA-Grad++: An Improved Griffin-Lim Guided Diffusion Model for Speech Synthesis
- Authors: Teysir Baoueb, Xiaoyu Bie, Mathieu Fontaine, Gaƫl Richard,
- Abstract summary: We propose a phase-aware extension to the WaveGrad vocoder to reduce inconsistencies between generated signals and conditioning mel spectrogram.<n>We compute the correction term only once, with a single application of GLA, to accelerate the generation process.
- Score: 26.232361901331927
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in diffusion models have positioned them as powerful generative frameworks for speech synthesis, demonstrating substantial improvements in audio quality and stability. Nevertheless, their effectiveness in vocoders conditioned on mel spectrograms remains constrained, particularly when the conditioning diverges from the training distribution. The recently proposed GLA-Grad model introduced a phase-aware extension to the WaveGrad vocoder that integrated the Griffin-Lim algorithm (GLA) into the reverse process to reduce inconsistencies between generated signals and conditioning mel spectrogram. In this paper, we further improve GLA-Grad through an innovative choice in how to apply the correction. Particularly, we compute the correction term only once, with a single application of GLA, to accelerate the generation process. Experimental results demonstrate that our method consistently outperforms the baseline models, particularly in out-of-domain scenarios.
Related papers
- Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation [45.717539734334906]
Inference-time scaling offers a versatile paradigm for aligning visual generative models with downstream objectives without parameter updates.<n>We show that existing approaches that optimize the high-dimensional initial noise suffer from severe inefficiency, as many search directions exert negligible influence on the final generation.<n>We propose Spectral Evolution Search (SES), a plug-and-play framework for initial noise optimization that executes gradient-free evolutionary search within a low-frequency subspace.
arXiv Detail & Related papers (2026-02-03T07:19:39Z) - Guiding Visual Autoregressive Models through Spectrum Weakening [44.26047250249648]
We propose a spectrum-weakening framework for visual autoregressive (AR) models.<n>It achieves this by constructing a controllable weak model in the spectral domain.<n>Our method enables high-quality unconditional generation while maintaining strong prompt alignment for conditional generation.
arXiv Detail & Related papers (2025-11-28T08:52:50Z) - Quantum Reinforcement Learning-Guided Diffusion Model for Image Synthesis via Hybrid Quantum-Classical Generative Model Architectures [2.005299372367689]
We introduce a quantum reinforcement learning (QRL) controller that dynamically adjusts CFG at each denoising step.<n>The controller adopts a hybrid quantum--classical actor--critic architecture.<n> Experiments on CIFAR-10 demonstrate that our QRL policy improves perceptual quality.
arXiv Detail & Related papers (2025-09-17T16:47:04Z) - Inference-Time Alignment Control for Diffusion Models with Reinforcement Learning Guidance [46.06527859746679]
We introduce Reinforcement Learning Guidance (RLG), an inference-time method that adapts Dejin-Free Guidance (CFG)<n>RLG consistently improves the performance of RL fine-tuned models across various, RL algorithms, and downstream tasks, including human preferences, compositional control, compress, and text rendering.<n>Our approach provides a practical and theoretically sound solution for enhancing and controlling diffusion model alignment inference.
arXiv Detail & Related papers (2025-08-28T17:18:31Z) - Inference-Time Scaling of Diffusion Language Models with Particle Gibbs Sampling [70.8832906871441]
We study how to steer generation toward desired rewards without retraining the models.<n>Prior methods typically resample or filter within a single denoising trajectory, optimizing rewards step-by-step without trajectory-level refinement.<n>We introduce particle Gibbs sampling for diffusion language models (PG-DLM), a novel inference-time algorithm enabling trajectory-level refinement while preserving generation perplexity.
arXiv Detail & Related papers (2025-07-11T08:00:47Z) - Gating is Weighting: Understanding Gated Linear Attention through In-context Learning [48.90556054777393]
Gated Linear Attention (GLA) architectures include competitive models such as Mamba and RWKV.<n>We show that a multilayer GLA can implement a general class of Weighted Preconditioned Gradient Descent (WPGD) algorithms.<n>Under mild conditions, we establish the existence and uniqueness (up to scaling) of a global minimum, corresponding to a unique WPGD solution.
arXiv Detail & Related papers (2025-04-06T00:37:36Z) - WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching [1.6385815610837167]
WaveFM is a flow matching model for mel-spectrogram conditioned speech synthesis.<n>Our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders.
arXiv Detail & Related papers (2025-03-20T20:17:17Z) - Reward-Guided Iterative Refinement in Diffusion Models at Test-Time with Applications to Protein and DNA Design [87.58981407469977]
We propose a novel framework for inference-time reward optimization with diffusion models inspired by evolutionary algorithms.<n>Our approach employs an iterative refinement process consisting of two steps in each iteration: noising and reward-guided denoising.
arXiv Detail & Related papers (2025-02-20T17:48:45Z) - GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model [0.0]
We propose GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process.
We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.
arXiv Detail & Related papers (2024-02-09T12:12:52Z) - Butterfly Effects of SGD Noise: Error Amplification in Behavior Cloning
and Autoregression [70.78523583702209]
We study training instabilities of behavior cloning with deep neural networks.
We observe that minibatch SGD updates to the policy network during training result in sharp oscillations in long-horizon rewards.
arXiv Detail & Related papers (2023-10-17T17:39:40Z) - CATfOOD: Counterfactual Augmented Training for Improving Out-of-Domain
Performance and Calibration [59.48235003469116]
We show that data augmentation consistently enhances OOD performance.
We also show that CF augmented models which are easier to calibrate also exhibit much lower entropy when assigning importance.
arXiv Detail & Related papers (2023-09-14T16:16:40Z) - End-to-End Diffusion Latent Optimization Improves Classifier Guidance [81.27364542975235]
Direct Optimization of Diffusion Latents (DOODL) is a novel guidance method.
It enables plug-and-play guidance by optimizing diffusion latents.
It outperforms one-step classifier guidance on computational and human evaluation metrics.
arXiv Detail & Related papers (2023-03-23T22:43:52Z) - A weighted-variance variational autoencoder model for speech enhancement [0.0]
We propose a weighted variance generative model, where the contribution of each spectrogram time-frame in parameter learning is weighted.
We develop efficient training and speech enhancement algorithms based on the proposed generative model.
arXiv Detail & Related papers (2022-11-02T09:51:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.