Nonparallel High-Quality Audio Super Resolution with Domain Adaptation
and Resampling CycleGANs
- URL: http://arxiv.org/abs/2210.15887v1
- Date: Fri, 28 Oct 2022 04:32:59 GMT
- Title: Nonparallel High-Quality Audio Super Resolution with Domain Adaptation
and Resampling CycleGANs
- Authors: Reo Yoneyama, Ryuichi Yamamoto, Kentaro Tachibana
- Abstract summary: We propose a high-quality audio super-resolution method that can utilize unpaired data based on two connected cycle consistent generative adversarial networks (CycleGAN)
Our method decomposes the super-resolution method into domain adaptation and resampling processes to handle acoustic mismatch in the unpaired low- and high-resolution signals.
Experimental results verify that the proposed method significantly outperforms conventional methods when paired data are not available.
- Score: 9.593925140084846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural audio super-resolution models are typically trained on low- and
high-resolution audio signal pairs. Although these methods achieve highly
accurate super-resolution if the acoustic characteristics of the input data are
similar to those of the training data, challenges remain: the models suffer
from quality degradation for out-of-domain data, and paired data are required
for training. To address these problems, we propose Dual-CycleGAN, a
high-quality audio super-resolution method that can utilize unpaired data based
on two connected cycle consistent generative adversarial networks (CycleGAN).
Our method decomposes the super-resolution method into domain adaptation and
resampling processes to handle acoustic mismatch in the unpaired low- and
high-resolution signals. The two processes are then jointly optimized within
the CycleGAN framework. Experimental results verify that the proposed method
significantly outperforms conventional methods when paired data are not
available. Code and audio samples are available from
https://chomeyama.github.io/DualCycleGAN-Demo/.
Related papers
- Model and Deep learning based Dynamic Range Compression Inversion [12.002024727237837]
Inverting DRC can help to restore the original dynamics to produce new mixes and/or to improve the overall quality of the audio signal.
We propose a model-based approach with neural networks for DRC inversion.
Our results show the effectiveness and robustness of the proposed method in comparison to several state-of-the-art methods.
arXiv Detail & Related papers (2024-11-07T00:33:07Z) - A Bilevel Optimization Framework for Imbalanced Data Classification [1.6385815610837167]
We propose a new undersampling approach that avoids the pitfalls of noise and overlap caused by synthetic data.
Instead of undersampling majority data randomly, our method undersamples datapoints based on their ability to improve model loss.
Using improved model loss as a proxy measurement for classification performance, our technique assesses a datapoint's impact on loss and rejects those unable to improve it.
arXiv Detail & Related papers (2024-10-15T01:17:23Z) - Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data [69.7174072745851]
We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data.
To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization.
To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models.
arXiv Detail & Related papers (2024-10-02T22:05:36Z) - DiffSED: Sound Event Detection with Denoising Diffusion [70.18051526555512]
We reformulate the SED problem by taking a generative learning perspective.
Specifically, we aim to generate sound temporal boundaries from noisy proposals in a denoising diffusion process.
During training, our model learns to reverse the noising process by converting noisy latent queries to the groundtruth versions.
arXiv Detail & Related papers (2023-08-14T17:29:41Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - DuDGAN: Improving Class-Conditional GANs via Dual-Diffusion [2.458437232470188]
Class-conditional image generation using generative adversarial networks (GANs) has been investigated through various techniques.
We propose a novel approach for class-conditional image generation using GANs called DuDGAN, which incorporates a dual diffusion-based noise injection process.
Our method outperforms state-of-the-art conditional GAN models for image generation in terms of performance.
arXiv Detail & Related papers (2023-05-24T07:59:44Z) - Conditional Denoising Diffusion for Sequential Recommendation [62.127862728308045]
Two prominent generative models, Generative Adversarial Networks (GANs) and Variational AutoEncoders (VAEs)
GANs suffer from unstable optimization, while VAEs are prone to posterior collapse and over-smoothed generations.
We present a conditional denoising diffusion model, which includes a sequence encoder, a cross-attentive denoising decoder, and a step-wise diffuser.
arXiv Detail & Related papers (2023-04-22T15:32:59Z) - Learning Phone Recognition from Unpaired Audio and Phone Sequences Based
on Generative Adversarial Network [58.82343017711883]
This paper investigates how to learn directly from unpaired phone sequences and speech utterances.
GAN training is adopted in the first stage to find the mapping relationship between unpaired speech and phone sequence.
In the second stage, another HMM model is introduced to train from the generator's output, which boosts the performance.
arXiv Detail & Related papers (2022-07-29T09:29:28Z) - SPI-GAN: Denoising Diffusion GANs with Straight-Path Interpolations [27.487728842037935]
We present an enhanced GAN-based denoising method, called SPI-GAN, using our proposed straight-path definition.
SPI-GAN is one of the best-balanced models among the sampling quality, diversity, and time for CIFAR-10, and CelebA-HQ-256.
arXiv Detail & Related papers (2022-06-29T08:40:55Z) - RefineGAN: Universally Generating Waveform Better than Ground Truth with
Highly Accurate Pitch and Intensity Responses [15.599745604729842]
We propose RefineGAN, a high-fidelity neural vocoder with faster-than-real-time generation capability.
We employ a pitch-guided refine architecture with a multi-scale spectrogram-based loss function to help stabilize the training process.
We show that the fidelity is even improved during the waveform reconstruction by eliminating defects produced by the speaker.
arXiv Detail & Related papers (2021-11-01T14:12:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.