Audio Super-Resolution with Latent Bridge Models
- URL: http://arxiv.org/abs/2509.17609v2
- Date: Fri, 26 Sep 2025 05:39:37 GMT
- Title: Audio Super-Resolution with Latent Bridge Models
- Authors: Chang Li, Zehua Chen, Liyuan Wang, Jun Zhu,
- Abstract summary: We present a new system with latent bridge models (LBMs)<n>We compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process.<n>To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs.
- Score: 43.773399487946755
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio super-resolution (SR), i.e., upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-toHR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https://AudioLBM.github.io/.
Related papers
- Inference-time Scaling for Diffusion-based Audio Super-resolution [27.246435209069865]
Diffusion models have demonstrated remarkable success in generative tasks, including audio super-resolution (SR)<n>Here, we propose a different paradigm through inference-time scaling for SR, which explores multiple solution trajectories during the sampling process.<n>By actively guiding the exploration of the high-dimensional solution space through verifier-algorithm combinations, we enable more robust and higher-quality outputs.
arXiv Detail & Related papers (2025-08-04T13:17:49Z) - SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution [55.14432034345353]
We study key design principles for latter cascaded video super-resolution models, which are underexplored currently.<n>First, we propose two strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator.<n>Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs.
arXiv Detail & Related papers (2025-06-24T17:57:26Z) - FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching [29.12032530972612]
FLowHigh is a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution.<n>The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates.
arXiv Detail & Related papers (2025-01-09T02:30:26Z) - Arbitrary-steps Image Super-resolution via Diffusion Inversion [68.78628844966019]
This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance.<n>We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point.<n>Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result.
arXiv Detail & Related papers (2024-12-12T07:24:13Z) - Wave-U-Mamba: An End-To-End Framework For High-Quality And Efficient Speech Super Resolution [4.495657539150699]
Speech Super-Resolution (SSR) is a task of enhancing low-resolution speech signals by restoring missing high-frequency components.<n>Conventional approaches typically reconstruct log-mel features, followed by a vocoder that generates high-resolution speech in the waveform domain.<n>We propose a method, referred to as Wave-U-Mamba, that directly performs SSR in time domain.
arXiv Detail & Related papers (2024-09-14T06:52:00Z) - SinSR: Diffusion-Based Image Super-Resolution in a Single Step [119.18813219518042]
Super-resolution (SR) methods based on diffusion models exhibit promising results.
But their practical application is hindered by the substantial number of required inference steps.
We propose a simple yet effective method for achieving single-step SR generation, named SinSR.
arXiv Detail & Related papers (2023-11-23T16:21:29Z) - ProDiff: Progressive Fast Diffusion Model For High-Quality
Text-to-Speech [63.780196620966905]
We propose ProDiff, on progressive fast diffusion model for high-quality text-to-speech.
ProDiff parameterizes the denoising model by directly predicting clean data to avoid distinct quality degradation in accelerating sampling.
Our evaluation demonstrates that ProDiff needs only 2 iterations to synthesize high-fidelity mel-spectrograms.
ProDiff enables a sampling speed of 24x faster than real-time on a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2022-07-13T17:45:43Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.