Audio Dequantization for High Fidelity Audio Generation in Flow-based
Neural Vocoder
- URL: http://arxiv.org/abs/2008.06867v1
- Date: Sun, 16 Aug 2020 09:37:18 GMT
- Title: Audio Dequantization for High Fidelity Audio Generation in Flow-based
Neural Vocoder
- Authors: Hyun-Wook Yoon, Sang-Hoon Lee, Hyeong-Rae Noh, Seong-Whan Lee
- Abstract summary: Flow-based neural vocoder has shown significant improvement in real-time speech generation task.
We propose audio dequantization methods in flow-based neural vocoder for high fidelity audio generation.
- Score: 29.63675159839434
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent works, a flow-based neural vocoder has shown significant
improvement in real-time speech generation task. The sequence of invertible
flow operations allows the model to convert samples from simple distribution to
audio samples. However, training a continuous density model on discrete audio
data can degrade model performance due to the topological difference between
latent and actual distribution. To resolve this problem, we propose audio
dequantization methods in flow-based neural vocoder for high fidelity audio
generation. Data dequantization is a well-known method in image generation but
has not yet been studied in the audio domain. For this reason, we implement
various audio dequantization methods in flow-based neural vocoder and
investigate the effect on the generated audio. We conduct various objective
performance assessments and subjective evaluation to show that audio
dequantization can improve audio generation quality. From our experiments,
using audio dequantization produces waveform audio with better harmonic
structure and fewer digital artifacts.
Related papers
- Learning Source Disentanglement in Neural Audio Codec [20.335701584949526]
We introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation.
By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations.
Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space.
arXiv Detail & Related papers (2024-09-17T14:21:02Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.
We propose Frieren, a V2A model based on rectified flow matching.
Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - HiddenSinger: High-Quality Singing Voice Synthesis via Neural Audio
Codec and Latent Diffusion Models [25.966328901566815]
We propose HiddenSinger, a high-quality singing voice synthesis system using neural audio and latent diffusion models.
In addition, our proposed model is extended to an unsupervised singing voice learning framework, HiddenSinger-U, to train the model.
Experimental results demonstrate that our model outperforms previous models in terms of audio quality.
arXiv Detail & Related papers (2023-06-12T01:21:41Z) - Audio-Visual Speech Enhancement with Score-Based Generative Models [22.559617939136505]
This paper introduces an audio-visual speech enhancement system that leverages score-based generative models.
We exploit audio-visual embeddings obtained from a self-super-vised learning model that has been fine-tuned on lipreading.
Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality.
arXiv Detail & Related papers (2023-06-02T10:43:42Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Hierarchical Diffusion Models for Singing Voice Neural Vocoder [21.118585353100634]
We propose a hierarchical diffusion model for singing voice neural vocoders.
Experimental results show that the proposed method produces high-quality singing voices for multiple singers.
arXiv Detail & Related papers (2022-10-14T04:30:09Z) - AudioGen: Textually Guided Audio Generation [116.57006301417306]
We tackle the problem of generating audio samples conditioned on descriptive text captions.
In this work, we propose AaudioGen, an auto-regressive model that generates audio samples conditioned on text inputs.
arXiv Detail & Related papers (2022-09-30T10:17:05Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - SoundStream: An End-to-End Neural Audio Codec [78.94923131038682]
We present SoundStream, a novel neural audio system that can efficiently compress speech, music and general audio.
SoundStream relies on a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end.
We are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency.
arXiv Detail & Related papers (2021-07-07T15:45:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.