FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching
- URL: http://arxiv.org/abs/2501.04926v1
- Date: Thu, 09 Jan 2025 02:30:26 GMT
- Title: FLowHigh: Towards Efficient and High-Quality Audio Super-Resolution with Single-Step Flow Matching
- Authors: Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee,
- Abstract summary: FLowHigh is a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution.
The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates.
- Score: 29.12032530972612
- License:
- Abstract: Audio super-resolution is challenging owing to its ill-posed nature. Recently, the application of diffusion models in audio super-resolution has shown promising results in alleviating this challenge. However, diffusion-based models have limitations, primarily the necessity for numerous sampling steps, which causes significantly increased latency when synthesizing high-quality audio samples. In this paper, we propose FLowHigh, a novel approach that integrates flow matching, a highly efficient generative model, into audio super-resolution. We also explore probability paths specially tailored for audio super-resolution, which effectively capture high-resolution audio distributions, thereby enhancing reconstruction quality. The proposed method generates high-fidelity, high-resolution audio through a single-step sampling process across various input sampling rates. The experimental results on the VCTK benchmark dataset demonstrate that FLowHigh achieves state-of-the-art performance in audio super-resolution, as evaluated by log-spectral distance and ViSQOL while maintaining computational efficiency with only a single-step sampling process.
Related papers
- FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation [61.61415607972597]
DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale.
High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs)
We propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality.
arXiv Detail & Related papers (2025-02-07T18:59:59Z) - Fast T2T: Optimization Consistency Speeds Up Diffusion-Based Training-to-Testing Solving for Combinatorial Optimization [83.65278205301576]
We propose to learn direct mappings from different noise levels to the optimal solution for a given instance, facilitating high-quality generation with minimal shots.
This is achieved through an optimization consistency training protocol, which minimizes the difference among samples.
Experiments on two popular tasks, the Traveling Salesman Problem (TSP) and Maximal Independent Set (MIS), demonstrate the superiority of Fast T2T regarding both solution quality and efficiency.
arXiv Detail & Related papers (2025-02-05T07:13:43Z) - Arbitrary-steps Image Super-resolution via Diffusion Inversion [68.78628844966019]
This study presents a new image super-resolution (SR) technique based on diffusion inversion, aiming at harnessing the rich image priors encapsulated in large pre-trained diffusion models to improve SR performance.
We design a Partial noise Prediction strategy to construct an intermediate state of the diffusion model, which serves as the starting sampling point.
Once trained, this noise predictor can be used to initialize the sampling process partially along the diffusion trajectory, generating the desirable high-resolution result.
arXiv Detail & Related papers (2024-12-12T07:24:13Z) - PTQ4ADM: Post-Training Quantization for Efficient Text Conditional Audio Diffusion Models [8.99127212785609]
This work introduces PTQ4ADM, a novel framework for quantizing audio diffusion models (ADMs)
Our key contributions include (1) a coverage-driven prompt augmentation method and (2) an activation-aware calibration set generation algorithm for text-conditional ADMs.
Extensive experiments demonstrate PTQ4ADM's capability to reduce the model size by up to 70% while achieving synthesis quality metrics comparable to full-precision models.
arXiv Detail & Related papers (2024-09-20T20:52:56Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - Frequency-Domain Refinement with Multiscale Diffusion for Super Resolution [7.29314801047906]
We propose a novel Frequency Domain-guided multiscale Diffusion model (FDDiff)
FDDiff decomposes the high-frequency information complementing process into finer-grained steps.
We show that FDDiff outperforms prior generative methods with higher-fidelity super-resolution results.
arXiv Detail & Related papers (2024-05-16T11:58:52Z) - ACDMSR: Accelerated Conditional Diffusion Models for Single Image
Super-Resolution [84.73658185158222]
We propose a diffusion model-based super-resolution method called ACDMSR.
Our method adapts the standard diffusion model to perform super-resolution through a deterministic iterative denoising process.
Our approach generates more visually realistic counterparts for low-resolution images, emphasizing its effectiveness in practical scenarios.
arXiv Detail & Related papers (2023-07-03T06:49:04Z) - Nonparallel High-Quality Audio Super Resolution with Domain Adaptation
and Resampling CycleGANs [9.593925140084846]
We propose a high-quality audio super-resolution method that can utilize unpaired data based on two connected cycle consistent generative adversarial networks (CycleGAN)
Our method decomposes the super-resolution method into domain adaptation and resampling processes to handle acoustic mismatch in the unpaired low- and high-resolution signals.
Experimental results verify that the proposed method significantly outperforms conventional methods when paired data are not available.
arXiv Detail & Related papers (2022-10-28T04:32:59Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - CRASH: Raw Audio Score-based Generative Modeling for Controllable
High-resolution Drum Sound Synthesis [0.0]
We propose a novel score-base generative model for unconditional raw audio synthesis.
Our proposed method closes the gap with GAN-based methods on raw audio, while offering more flexible generation capabilities.
arXiv Detail & Related papers (2021-06-14T13:48:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.