AudioSR: Versatile Audio Super-resolution at Scale
- URL: http://arxiv.org/abs/2309.07314v1
- Date: Wed, 13 Sep 2023 21:00:09 GMT
- Title: AudioSR: Versatile Audio Super-resolution at Scale
- Authors: Haohe Liu, Ke Chen, Qiao Tian, Wenwu Wang, Mark D. Plumbley
- Abstract summary: We introduce a diffusion-based generative model, AudioSR, that is capable of performing robust audio super-resolution on versatile audio types.
Specifically, AudioSR can upsample any input audio signal within the bandwidth range of 2kHz to 16kHz to a high-resolution audio signal at 24kHz bandwidth.
- Score: 32.36683443201372
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio super-resolution is a fundamental task that predicts high-frequency
components for low-resolution audio, enhancing audio quality in digital
applications. Previous methods have limitations such as the limited scope of
audio types (e.g., music, speech) and specific bandwidth settings they can
handle (e.g., 4kHz to 8kHz). In this paper, we introduce a diffusion-based
generative model, AudioSR, that is capable of performing robust audio
super-resolution on versatile audio types, including sound effects, music, and
speech. Specifically, AudioSR can upsample any input audio signal within the
bandwidth range of 2kHz to 16kHz to a high-resolution audio signal at 24kHz
bandwidth with a sampling rate of 48kHz. Extensive objective evaluation on
various audio super-resolution benchmarks demonstrates the strong result
achieved by the proposed model. In addition, our subjective evaluation shows
that AudioSR can acts as a plug-and-play module to enhance the generation
quality of a wide range of audio generative models, including AudioLDM,
Fastspeech2, and MusicGen. Our code and demo are available at
https://audioldm.github.io/audiosr.
Related papers
- Audio Mamba: Pretrained Audio State Space Model For Audio Tagging [1.2123876307427102]
We propose Audio Mamba, a self-attention-free approach that captures long audio spectrogram dependency with state space models.
Our experimental results on two audio-tagging datasets demonstrate the parameter efficiency of Audio Mamba, it achieves comparable results to SOTA audio spectrogram transformers with one third parameters.
arXiv Detail & Related papers (2024-05-22T13:35:56Z) - Fast Timing-Conditioned Latent Audio Diffusion [8.774733281142021]
Stable Audio is capable of rendering stereo signals of up to 95 sec at 44.1kHz in 8 sec on an A100 GPU.
It is one of the best in two public text-to-music and -audio benchmarks and, differently from state-of-the-art models, can generate music with structure and stereo sounds.
arXiv Detail & Related papers (2024-02-07T13:23:25Z) - Retrieval-Augmented Text-to-Audio Generation [36.328134891428085]
We show that the state-of-the-art models, such as AudioLDM, are biased in their generation performance.
We propose a simple retrieval-augmented approach for TTA models.
We show that Re-AudioLDM can generate realistic audio for complex scenes, rare audio classes, and even unseen audio types.
arXiv Detail & Related papers (2023-09-14T22:35:39Z) - AdVerb: Visually Guided Audio Dereverberation [49.958724234969445]
We present AdVerb, a novel audio-visual dereverberation framework.
It uses visual cues in addition to the reverberant sound to estimate clean audio.
arXiv Detail & Related papers (2023-08-23T18:20:59Z) - Separate Anything You Describe [55.0784713558149]
Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA)
AudioSep is a foundation model for open-domain audio source separation with natural language queries.
arXiv Detail & Related papers (2023-08-09T16:09:44Z) - Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion
Models [65.18102159618631]
multimodal generative modeling has created milestones in text-to-image and text-to-video generation.
Its application to audio still lags behind for two main reasons: the lack of large-scale datasets with high-quality text-audio pairs, and the complexity of modeling long continuous audio data.
We propose Make-An-Audio with a prompt-enhanced diffusion model that addresses these gaps.
arXiv Detail & Related papers (2023-01-30T04:44:34Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z) - NU-GAN: High resolution neural upsampling with GAN [60.02736450639215]
NU-GAN is a new method for resampling audio from lower to higher sampling rates (upsampling)
Such applications use audio at a resolution of 44.1 kHz or 48 kHz, whereas current speech synthesis methods are equipped to handle a maximum of 24 kHz resolution.
ABX preference tests indicate that our NU-GAN resampler is capable of resampling 22 kHz to 44.1 kHz audio that is distinguishable from original audio only 7.4% higher than random chance for single speaker dataset, and 10.8% higher than chance for multi-speaker dataset.
arXiv Detail & Related papers (2020-10-22T01:00:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.