Generative Modeling for Low Dimensional Speech Attributes with Neural
Spline Flows
- URL: http://arxiv.org/abs/2203.01786v1
- Date: Thu, 3 Mar 2022 15:58:08 GMT
- Title: Generative Modeling for Low Dimensional Speech Attributes with Neural
Spline Flows
- Authors: Kevin J. Shih, Rafael Valle, Rohan Badlani, J\~oao Felipe Santos,
Bryan Catanzaro
- Abstract summary: Pitch information is not only low-dimensional, but also discontinuous, making it particularly difficult to model in a generative setting.
We find this problem to be very well suited for Neural Spline flows, which is a highly expressive alternative to the more common affinecoupling mechanism in Normalizing Flows.
- Score: 22.78165635389179
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Despite recent advances in generative modeling for text-to-speech synthesis,
these models do not yet have the same fine-grained adjustability of
pitch-conditioned deterministic models such as FastPitch and FastSpeech2. Pitch
information is not only low-dimensional, but also discontinuous, making it
particularly difficult to model in a generative setting. Our work explores
several techniques for handling the aforementioned issues in the context of
Normalizing Flow models. We also find this problem to be very well suited for
Neural Spline flows, which is a highly expressive alternative to the more
common affine-coupling mechanism in Normalizing Flows.
Related papers
- SMILE: Zero-Shot Sparse Mixture of Low-Rank Experts Construction From Pre-Trained Foundation Models [85.67096251281191]
We present an innovative approach to model fusion called zero-shot Sparse MIxture of Low-rank Experts (SMILE) construction.
SMILE allows for the upscaling of source models into an MoE model without extra data or further training.
We conduct extensive experiments across diverse scenarios, such as image classification and text generation tasks, using full fine-tuning and LoRA fine-tuning.
arXiv Detail & Related papers (2024-08-19T17:32:15Z) - Guided Flows for Generative Modeling and Decision Making [55.42634941614435]
We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text synthesis-to-speech.
Notably, we are first to apply flow models for plan generation in the offline reinforcement learning setting ax speedup in compared to diffusion models.
arXiv Detail & Related papers (2023-11-22T15:07:59Z) - Generative Pre-training for Speech with Flow Matching [81.59952572752248]
We pre-trained a generative model, named SpeechFlow, on 60k hours of untranscribed speech with Flow Matching and masked conditions.
Experiment results show the pre-trained generative model can be fine-tuned with task-specific data to match or surpass existing expert models on speech enhancement, separation, and synthesis.
arXiv Detail & Related papers (2023-10-25T03:40:50Z) - Kernelised Normalising Flows [10.31916245015817]
Normalising Flows are non-parametric statistical models characterised by their dual capabilities of density estimation and generation.
We present Ferumal flow, a novel kernelised normalising flow paradigm that integrates kernels into the framework.
arXiv Detail & Related papers (2023-07-27T13:18:52Z) - Conditional Generation from Unconditional Diffusion Models using
Denoiser Representations [94.04631421741986]
We propose adapting pre-trained unconditional diffusion models to new conditions using the learned internal representations of the denoiser network.
We show that augmenting the Tiny ImageNet training set with synthetic images generated by our approach improves the classification accuracy of ResNet baselines by up to 8%.
arXiv Detail & Related papers (2023-06-02T20:09:57Z) - DiffusER: Discrete Diffusion via Edit-based Reconstruction [88.62707047517914]
DiffusER is an edit-based generative model for text based on denoising diffusion models.
It can rival autoregressive models on several tasks spanning machine translation, summarization, and style transfer.
It can also perform other varieties of generation that standard autoregressive models are not well-suited for.
arXiv Detail & Related papers (2022-10-30T16:55:23Z) - Distilling the Knowledge from Normalizing Flows [22.578033953780697]
Normalizing flows are a powerful class of generative models demonstrating strong performance in several speech and vision problems.
We propose a simple distillation approach and demonstrate its effectiveness on state-of-the-art conditional flow-based models for image super-resolution and speech synthesis.
arXiv Detail & Related papers (2021-06-24T00:10:22Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z) - WaveNODE: A Continuous Normalizing Flow for Speech Synthesis [15.051929807285847]
We propose a novel generative model called WaveNODE which exploits a continuous normalizing flow for speech synthesis.
WaveNODE places no constraint on the function used for flow operation, thus allowing the usage of more flexible and complex functions.
We experimentally show that WaveNODE achieves comparable performance with fewer parameters compared to the conventional flow-based vocoders.
arXiv Detail & Related papers (2020-06-08T13:49:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.