Related papers: Inference-time Scaling for Diffusion-based Audio Super-resolution

Inference-time Scaling for Diffusion-based Audio Super-resolution

URL: http://arxiv.org/abs/2508.02391v1
Date: Mon, 04 Aug 2025 13:17:49 GMT
Title: Inference-time Scaling for Diffusion-based Audio Super-resolution
Authors: Yizhu Jin, Zhen Ye, Zeyue Tian, Haohe Liu, Qiuqiang Kong, Yike Guo, Wei Xue,
Abstract summary: Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR)<n>Here, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process.<n>By actively guiding the exploration of the high-dimensional solution space through verifier-algorithm combinations, we enable more robust and higher-quality outputs.
Score: 27.246435209069865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR). In many applications like movie post-production and album mastering, substantial computational budgets are available for achieving superior audio quality. However, while existing diffusion approaches typically increase sampling steps to improve quality, the performance remains fundamentally limited by the stochastic nature of the sampling process, leading to high-variance and quality-limited outputs. Here, rather than simply increasing the number of sampling steps, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process. Different task-specific verifiers are developed, and two search algorithms, including the random search and zero-order search for SR, are introduced. By actively guiding the exploration of the high-dimensional solution space through verifier-algorithm combinations, we enable more robust and higher-quality outputs. Through extensive validation across diverse audio domains (speech, music, sound effects) and frequency ranges, we demonstrate consistent performance gains, achieving improvements of up to 9.70% in aesthetics, 5.88% in speaker similarity, 15.20% in word error rate, and 46.98% in spectral distance for speech SR from 4kHz to 24kHz, showcasing the effectiveness of our approach. Audio samples are available at: https://racerk.github.io/tt-scale-audiosr/.

Related papers

Token-based Audio Inpainting via Discrete Diffusion [14.23046540809056]
We introduce a novel inpainting method based on discrete diffusion modeling, which operates over tokenized audio representations.<n>Our approach models the generative process directly in the discrete latent space, enabling stable and semantically coherent reconstruction of missing audio.
arXiv Detail & Related papers (2025-07-11T06:25:49Z)
Learning to Upsample and Upmix Audio in the Latent Domain [13.82572699087732]
Neural audio autoencoders create compact latent representations that preserve perceptually important information.<n>We propose a framework that performs audio processing operations entirely within an autoencoder's latent space.<n>We demonstrate computational efficiency gains of up to 100x while maintaining quality comparable to post-processing on raw audio.
arXiv Detail & Related papers (2025-05-31T19:27:22Z)
Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization [83.65278205301576]
We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots.<n>This is achieved through an optimization consistency training protocol, which minimizes the difference among samples.<n>Experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency.
arXiv Detail & Related papers (2025-02-05T07:13:43Z)
FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching [29.12032530972612]
FLowHigh is a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution.<n>The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates.
arXiv Detail & Related papers (2025-01-09T02:30:26Z)
PTQ4ADM: Post-Training Quantization for Efficient Text Conditional Audio Diffusion Models [8.99127212785609]
This work introduces PTQ4ADM, a novel framework for quantizing audio diffusion models (ADMs) Our key contributions include (1) a coverage-driven prompt augmentation method and (2) an activation-aware calibration set generation algorithm for text-conditional ADMs. Extensive experiments demonstrate PTQ4ADM's capability to reduce the model size by up to 70% while achieving synthesis quality metrics comparable to full-precision models.
arXiv Detail & Related papers (2024-09-20T20:52:56Z)
Autoregressive Diffusion Transformer for Text-to-Speech Synthesis [39.32761051774537]
We propose encoding audio as vector sequences in continuous space $mathbb Rd$ and autoregressively generating these sequences. High-bitrate continuous speech representation enables almost flawless reconstruction, allowing our model to achieve nearly perfect speech editing.
arXiv Detail & Related papers (2024-06-08T18:57:13Z)
Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z)
SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN. We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z)
AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework. It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z)
DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective. Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process. During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z)
ProDiff: Progressive Fast Diffusion Model For High-Quality Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech. ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling. Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms. ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z)
NU-GAN: High resolution neural upsampling with GAN [60.02736450639215]
NU-GAN is a new method for resampling audio from lower to higher sampling rates (upsampling) Such applications use audio at a resolution of 44.1 kHz or 48 kHz, whereas current speech synthesis methods are equipped to handle a maximum of 24 kHz resolution. ABX preference tests indicate that our NU-GAN resampler is capable of resampling 22 kHz to 44.1 kHz audio that is distinguishable from original audio only 7.4% higher than random chance for single speaker dataset, and 10.8% higher than chance for multi-speaker dataset.
arXiv Detail & Related papers (2020-10-22T01:00:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.