UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
- URL: http://arxiv.org/abs/2510.00771v1
- Date: Wed, 01 Oct 2025 11:04:53 GMT
- Title: UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching
- Authors: Woongjib Choi, Sangmin Lee, Hyungseob Lim, Hong-Goo Kang,
- Abstract summary: We present a framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients.<n> Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors.
- Score: 20.92242470770289
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we present a vocoder-free framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients. Unlike conventional two-stage diffusion-based approaches that predict a mel-spectrogram and then rely on a pre-trained neural vocoder to synthesize waveforms, our method directly reconstructs waveforms via the inverse Short-Time Fourier Transform (iSTFT), thereby eliminating the dependence on a separate vocoder. This design not only simplifies end-to-end optimization but also overcomes a critical bottleneck of two-stage pipelines, where the final audio quality is fundamentally constrained by vocoder performance. Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors, achieving state-of-the-art performance on both speech and general audio datasets.
Related papers
- Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum [1.3066182802188198]
We introduce prosody-guided harmonic attention to enhance voiced segment encoding and directly predict complex spectral components for waveform synthesis via inverse STFT.<n>Experiments on benchmark datasets demonstrate consistent gains over HiFi-GAN and AutoVocoder: F0 RMSE reduced by 22 percent, voiced/unvoiced error lowered by 18 percent, and MOS scores improved by 0.15.<n>These results show that prosody-guided attention combined with direct complex spectrum modeling yields more natural, pitch-accurate, and robust synthetic speech, setting a strong foundation for expressive neural vocoding.
arXiv Detail & Related papers (2026-01-20T20:53:24Z) - WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching [1.6385815610837167]
WaveFM is a flow matching model for mel-spectrogram conditioned speech synthesis.<n>Our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders.
arXiv Detail & Related papers (2025-03-20T20:17:17Z) - Training Universal Vocoders with Feature Smoothing-Based Augmentation Methods for High-Quality TTS Systems [6.998597120755703]
We present a novel augmentation technique for training universal vocoders.
Our training scheme randomly applies linear smoothing filters to input acoustic features.
It significantly mitigates the training-inference mismatch, enhancing the naturalness of synthetic output.
arXiv Detail & Related papers (2024-09-04T08:25:54Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - NeuralDPS: Neural Deterministic Plus Stochastic Model with Multiband
Excitation for Noise-Controllable Waveform Generation [67.96138567288197]
We propose a novel neural vocoder named NeuralDPS which can retain high speech quality and acquire high synthesis efficiency and noise controllability.
It generates waveforms at least 280 times faster than the WaveNet vocoder.
It is also 28% faster than WaveGAN's synthesis efficiency on a single core.
arXiv Detail & Related papers (2022-03-05T08:15:29Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.