Mel Spectrogram Inversion with Stable Pitch
- URL: http://arxiv.org/abs/2208.12782v1
- Date: Fri, 26 Aug 2022 17:01:57 GMT
- Title: Mel Spectrogram Inversion with Stable Pitch
- Authors: Bruno Di Giorgi, Mark Levy, Richard Sharp
- Abstract summary: Vocoders are models capable of transforming a low-dimensional spectral representation of an audio signal, typically the mel spectrogram, to a waveform.
Recent vocoder models developed for speech achieve a high degree of realism.
Compared to speech, the structure of the musical sound texture offers new challenges.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vocoders are models capable of transforming a low-dimensional spectral
representation of an audio signal, typically the mel spectrogram, to a
waveform. Modern speech generation pipelines use a vocoder as their final
component. Recent vocoder models developed for speech achieve a high degree of
realism, such that it is natural to wonder how they would perform on music
signals. Compared to speech, the heterogeneity and structure of the musical
sound texture offers new challenges. In this work we focus on one specific
artifact that some vocoder models designed for speech tend to exhibit when
applied to music: the perceived instability of pitch when synthesizing
sustained notes. We argue that the characteristic sound of this artifact is due
to the lack of horizontal phase coherence, which is often the result of using a
time-domain target space with a model that is invariant to time-shifts, such as
a convolutional neural network. We propose a new vocoder model that is
specifically designed for music. Key to improving the pitch stability is the
choice of a shift-invariant target space that consists of the magnitude
spectrum and the phase gradient. We discuss the reasons that inspired us to
re-formulate the vocoder task, outline a working example, and evaluate it on
musical signals. Our method results in 60% and 10% improved reconstruction of
sustained notes and chords with respect to existing models, using a novel
harmonic error metric.
Related papers
- PerTok: Expressive Encoding and Modeling of Symbolic Musical Ideas and Variations [0.3683202928838613]
Cadenza is a new multi-stage generative framework for predicting expressive variations of symbolic musical ideas.
The proposed framework comprises of two sequential stages: 1) Composer and 2) Performer.
Our framework is designed, researched and implemented with the objective of providing inspiration for musicians.
arXiv Detail & Related papers (2024-10-02T22:11:31Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Towards Improving Harmonic Sensitivity and Prediction Stability for
Singing Melody Extraction [36.45127093978295]
We propose an input feature modification and a training objective modification based on two assumptions.
To enhance the model's sensitivity on the trailing harmonics, we modify the Combined Frequency and Periodicity representation using discrete z-transform.
We apply these modifications to several models, including MSNet, FTANet, and a newly introduced model, PianoNet, modified from a piano transcription network.
arXiv Detail & Related papers (2023-08-04T21:59:40Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - Deep Performer: Score-to-Audio Music Performance Synthesis [30.95307878579825]
Deep Performer is a novel system for score-to-audio music performance synthesis.
Unlike speech, music often contains polyphony and long notes.
We show that our proposed model can synthesize music with clear polyphony and harmonic structures.
arXiv Detail & Related papers (2022-02-12T10:36:52Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z) - Autoencoding Neural Networks as Musical Audio Synthesizers [0.0]
A method for musical audio synthesis using autoencoding neural networks is proposed.
The autoencoder is trained to compress and reconstruct magnitude short-time Fourier transform frames.
arXiv Detail & Related papers (2020-04-27T20:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.