Differentiable Wavetable Synthesis
- URL: http://arxiv.org/abs/2111.10003v2
- Date: Tue, 23 Nov 2021 17:10:38 GMT
- Title: Differentiable Wavetable Synthesis
- Authors: Siyuan Shan, Lamtharn Hantrakul, Jitong Chen, Matt Avent, David
Trevelyan
- Abstract summary: Differentiable Wavetable Synthesis (DWTS) is a technique for neural audio synthesis which learns a dictionary of one-period waveforms.
We achieve high-fidelity audio synthesis with as little as 10 to 20 wavetables.
We show audio manipulations, such as high quality pitch-shifting, using only a few seconds of input audio.
- Score: 7.585969077788285
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Differentiable Wavetable Synthesis (DWTS) is a technique for neural audio
synthesis which learns a dictionary of one-period waveforms i.e. wavetables,
through end-to-end training. We achieve high-fidelity audio synthesis with as
little as 10 to 20 wavetables and demonstrate how a data-driven dictionary of
waveforms opens up unprecedented one-shot learning paradigms on short audio
clips. Notably, we show audio manipulations, such as high quality
pitch-shifting, using only a few seconds of input audio. Lastly, we investigate
performance gains from using learned wavetables for realtime and interactive
audio synthesis.
Related papers
- Audio-visual video-to-speech synthesis with synthesized input audio [64.86087257004883]
We investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference.
In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech.
arXiv Detail & Related papers (2023-07-31T11:39:05Z) - Large-scale unsupervised audio pre-training for video-to-speech
synthesis [64.86087257004883]
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker.
In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz.
We then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task.
arXiv Detail & Related papers (2023-06-27T13:31:33Z) - RAVE: A variational autoencoder for fast and high-quality neural audio
synthesis [2.28438857884398]
We introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis.
We show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU.
arXiv Detail & Related papers (2021-11-09T09:07:30Z) - WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis [80.60577805727624]
WaveGrad 2 is a non-autoregressive generative model for text-to-speech synthesis.
It can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system.
arXiv Detail & Related papers (2021-06-17T17:09:21Z) - Continuous Wavelet Vocoder-based Decomposition of Parametric Speech
Waveform Synthesis [2.6572330982240935]
Speech technology systems have adopted the vocoder approach to synthesizing speech waveform.
WaveNet is one of the best models that nearly resembles the human voice.
arXiv Detail & Related papers (2021-06-12T20:55:44Z) - End-to-End Video-To-Speech Synthesis using Generative Adversarial
Networks [54.43697805589634]
We propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs)
Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech.
We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID.
arXiv Detail & Related papers (2021-04-27T17:12:30Z) - Pretraining Strategies, Waveform Model Choice, and Acoustic
Configurations for Multi-Speaker End-to-End Speech Synthesis [47.30453049606897]
We find that fine-tuning a multi-speaker model from found audiobook data can improve naturalness and similarity to unseen target speakers of synthetic speech.
We also find that listeners can discern between a 16kHz and 24kHz sampling rate, and that WaveRNN produces output waveforms of a comparable quality to WaveNet.
arXiv Detail & Related papers (2020-11-10T00:19:04Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.