Axial Residual Networks for CycleGAN-based Voice Conversion
- URL: http://arxiv.org/abs/2102.08075v1
- Date: Tue, 16 Feb 2021 10:55:35 GMT
- Title: Axial Residual Networks for CycleGAN-based Voice Conversion
- Authors: Jaeseong You, Gyuhyeon Nam, Dalhyun Kim, Gyeongsu Chae
- Abstract summary: We propose a novel architecture and improved training objectives for non-parallel voice conversion.
Our proposed CycleGAN-based model performs a shape-preserving transformation directly on a high frequency-resolution magnitude spectrogram.
We demonstrate via experiments that our proposed model outperforms Scyclone and shows a comparable or better performance to that of CycleGAN-VC2 even without employing a neural vocoder.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel architecture and improved training objectives for
non-parallel voice conversion. Our proposed CycleGAN-based model performs a
shape-preserving transformation directly on a high frequency-resolution
magnitude spectrogram, converting its style (i.e. speaker identity) while
preserving the speech content. Throughout the entire conversion process, the
model does not resort to compressed intermediate representations of any sort
(e.g. mel spectrogram, low resolution spectrogram, decomposed network feature).
We propose an efficient axial residual block architecture to support this
expensive procedure and various modifications to the CycleGAN losses to
stabilize the training process. We demonstrate via experiments that our
proposed model outperforms Scyclone and shows a comparable or better
performance to that of CycleGAN-VC2 even without employing a neural vocoder.
Related papers
- Corner-to-Center Long-range Context Model for Efficient Learned Image
Compression [70.0411436929495]
In the framework of learned image compression, the context model plays a pivotal role in capturing the dependencies among latent representations.
We propose the textbfCorner-to-Center transformer-based Context Model (C$3$M) designed to enhance context and latent predictions.
In addition, to enlarge the receptive field in the analysis and synthesis transformation, we use the Long-range Crossing Attention Module (LCAM) in the encoder/decoder.
arXiv Detail & Related papers (2023-11-29T21:40:28Z) - Effective Invertible Arbitrary Image Rescaling [77.46732646918936]
Invertible Neural Networks (INN) are able to increase upscaling accuracy significantly by optimizing the downscaling and upscaling cycle jointly.
A simple and effective invertible arbitrary rescaling network (IARN) is proposed to achieve arbitrary image rescaling by training only one model in this work.
It is shown to achieve a state-of-the-art (SOTA) performance in bidirectional arbitrary rescaling without compromising perceptual quality in LR outputs.
arXiv Detail & Related papers (2022-09-26T22:22:30Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - CycleTransGAN-EVC: A CycleGAN-based Emotional Voice Conversion Model
with Transformer [11.543807097834785]
We propose a CycleGAN-based model with the transformer and investigate its ability in the emotional voice conversion task.
In the training procedure, we adopt curriculum learning to gradually increase the frame length so that the model can see from the short segment till the entire speech.
The results show that our proposed model is able to convert emotion with higher strength and quality.
arXiv Detail & Related papers (2021-11-30T06:33:57Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic
Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear
Prediction [38.828260316517536]
This paper presents a low-latency real-time (LLRT) non-parallel voice conversion framework based on cyclic variational autoencoder (CycleVAE) and multiband WaveRNN with data-driven linear prediction (MWDLP)
The proposed framework achieves high-performance VC, while allowing for LLRT usage with a single-core of $2.1$--$2.7$GHz CPU on a real-time factor of $0.87$--$0.95$, including input/output, feature extraction, on a frame shift of $10$ ms, a window length of $27.5$ ms
arXiv Detail & Related papers (2021-05-20T16:06:11Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z) - Non-parallel Emotion Conversion using a Deep-Generative Hybrid Network
and an Adversarial Pair Discriminator [16.18921154013272]
We introduce a novel method for emotion conversion in speech that does not require parallel training data.
Unlike the conventional cycle-GAN, our discriminator classifies whether a pair of input real and generated samples corresponds to the desired emotion conversion.
We show that our model generalizes to new speakers by modifying speech produced by Wavenet.
arXiv Detail & Related papers (2020-07-25T13:50:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.