CAESynth: Real-Time Timbre Interpolation and Pitch Control with
Conditional Autoencoders
- URL: http://arxiv.org/abs/2111.05174v1
- Date: Tue, 9 Nov 2021 14:36:31 GMT
- Title: CAESynth: Real-Time Timbre Interpolation and Pitch Control with
Conditional Autoencoders
- Authors: Aaron Valero Puche and Sukhan Lee
- Abstract summary: CAE Synth synthesizes timbre in real-time by interpolating the reference sounds in their shared latent feature space.
We show that training a conditional autoencoder based on accuracy in timbre classification together with adversarial regularization of pitch content allows timbre distribution in latent space to be more effective.
- Score: 3.0991538386316666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we present a novel audio synthesizer, CAESynth, based on a
conditional autoencoder. CAESynth synthesizes timbre in real-time by
interpolating the reference sounds in their shared latent feature space, while
controlling a pitch independently. We show that training a conditional
autoencoder based on accuracy in timbre classification together with
adversarial regularization of pitch content allows timbre distribution in
latent space to be more effective and stable for timbre interpolation and pitch
conditioning. The proposed method is applicable not only to creation of musical
cues but also to exploration of audio affordance in mixed reality based on
novel timbre mixtures with environmental sounds. We demonstrate by experiments
that CAESynth achieves smooth and high-fidelity audio synthesis in real-time
through timbre interpolation and independent yet accurate pitch control for
musical cues as well as for audio affordance with environmental sound. A Python
implementation along with some generated samples are shared online.
Related papers
- Wavetable Synthesis Using CVAE for Timbre Control Based on Semantic Label [2.0124254762298794]
This research introduces a method of timbre control in wavetable synthesis that is intuitive and sensible.
Using a conditional variational autoencoder (CVAE), users can select a wavetable and define the timbre with labels such as bright, warm, and rich.
arXiv Detail & Related papers (2024-10-24T10:37:54Z) - wav2pos: Sound Source Localization using Masked Autoencoders [12.306126455995603]
We present a novel approach to the 3D sound source localization task for distributed ad-hoc microphone arrays by formulating it as a set-to-set regression problem.
We show that such a formulation allows for accurate localization of the sound source, by reconstructing coordinates masked in the input.
arXiv Detail & Related papers (2024-08-28T13:09:20Z) - Bass Accompaniment Generation via Latent Diffusion [0.0]
We present a controllable system for generating single stems to accompany musical mixes of arbitrary length.
At the core of our method are audio autoencoders that efficiently compress audio waveform samples into invertible latent representations.
Our controllable conditional audio generation framework represents a significant step forward in creating generative AI tools to assist musicians in music production.
arXiv Detail & Related papers (2024-02-02T13:44:47Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - Synthesizer Preset Interpolation using Transformer Auto-Encoders [4.213427823201119]
We introduce a bimodal auto-encoder neural network, which simultaneously processes presets using multi-head attention blocks, and audio using convolutions.
This model has been tested on a popular frequency modulation synthesizer with more than one hundred parameters.
After training, the proposed model can be integrated into commercial synthesizers for live or sound design tasks.
arXiv Detail & Related papers (2022-10-27T15:20:18Z) - Using multiple reference audios and style embedding constraints for
speech synthesis [68.62945852651383]
The proposed model can improve the speech naturalness and content quality with multiple reference audios.
The model can also outperform the baseline model in ABX preference tests of style similarity.
arXiv Detail & Related papers (2021-10-09T04:24:29Z) - Neural Waveshaping Synthesis [0.0]
We present a novel, lightweight, fully causal approach to neural audio synthesis.
The Neural Waveshaping Unit (NEWT) operates directly in the waveform domain.
It produces complex timbral evolutions by simple affine transformations of its input and output signals.
arXiv Detail & Related papers (2021-07-11T13:50:59Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.