Vector-Quantized Timbre Representation
- URL: http://arxiv.org/abs/2007.06349v1
- Date: Mon, 13 Jul 2020 12:35:45 GMT
- Title: Vector-Quantized Timbre Representation
- Authors: Adrien Bitton, Philippe Esling, Tatsuya Harada
- Abstract summary: This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
- Score: 53.828476137089325
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Timbre is a set of perceptual attributes that identifies different types of
sound sources. Although its definition is usually elusive, it can be seen from
a signal processing viewpoint as all the spectral features that are perceived
independently from pitch and loudness. Some works have studied high-level
timbre synthesis by analyzing the feature relationships of different
instruments, but acoustic properties remain entangled and generation bound to
individual sounds. This paper targets a more flexible synthesis of an
individual timbre by learning an approximate decomposition of its spectral
properties with a set of generative features. We introduce an auto-encoder with
a discrete latent space that is disentangled from loudness in order to learn a
quantized representation of a given timbre distribution. Timbre transfer can be
performed by encoding any variable-length input signals into the quantized
latent features that are decoded according to the learned timbre. We detail
results for translating audio between orchestral instruments and singing voice,
as well as transfers from vocal imitations to instruments as an intuitive
modality to drive sound synthesis. Furthermore, we can map the discrete latent
space to acoustic descriptors and directly perform descriptor-based synthesis.
Related papers
- Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation [52.0893266767733]
We propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features.
To enhance the model's robustness to different synthesizer characteristics, we propose a synthesizer feature augmentation strategy.
arXiv Detail & Related papers (2024-11-14T03:57:21Z) - Real-time Timbre Remapping with Differentiable DSP [1.3803836644947054]
Timbre is a primary mode of expression in diverse musical contexts.
Our approach draws on the concept of timbre analogies.
We demonstrate real-time timbre remapping from acoustic snare drums to a differentiable synthesizer modeled after the Roland TR-808.
arXiv Detail & Related papers (2024-07-05T14:32:52Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - MIDI-DDSP: Detailed Control of Musical Performance via Hierarchical
Modeling [6.256118777336895]
Musical expression requires control of both what notes are played, and how they are performed.
We introduce MIDI-DDSP, a hierarchical model of musical instruments that enables both realistic neural audio synthesis and detailed user control.
We demonstrate that this hierarchy can reconstruct high-fidelity audio, accurately predict performance attributes for a note sequence, independently manipulate the attributes of a given performance, and as a complete system, generate realistic audio from a novel note sequence.
arXiv Detail & Related papers (2021-12-17T04:15:42Z) - A Unified Model for Zero-shot Music Source Separation, Transcription and
Synthesis [13.263771543118994]
We propose a unified model for three inter-related tasks: 1) to textitseparate individual sound sources from a mixed music audio, 2) to textittranscribe each sound source to MIDI notes, and 3) totextit synthesize new pieces based on the timbre of separated sources.
The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre.
arXiv Detail & Related papers (2021-08-07T14:28:21Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - HpRNet : Incorporating Residual Noise Modeling for Violin in a
Variational Parametric Synthesizer [11.4219428942199]
We introduce a dataset of Carnatic Violin Recordings where bow noise is an integral part of the playing style of higher pitched notes.
We obtain insights about each of the harmonic and residual components of the signal, as well as their interdependence.
arXiv Detail & Related papers (2020-08-19T12:48:32Z) - Neural Granular Sound Synthesis [53.828476137089325]
Granular sound synthesis is a popular audio generation technique based on rearranging sequences of small waveform windows.
We show that generative neural networks can implement granular synthesis while alleviating most of its shortcomings.
arXiv Detail & Related papers (2020-08-04T08:08:00Z) - Timbre latent space: exploration and creative aspects [1.3764085113103222]
Recent studies show the ability of unsupervised models to learn invertible audio representations using Auto-Encoders.
New possibilities for timbre manipulations are enabled with generative neural networks.
arXiv Detail & Related papers (2020-08-04T07:08:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.