An investigation of the reconstruction capacity of stacked convolutional
autoencoders for log-mel-spectrograms
- URL: http://arxiv.org/abs/2301.07665v1
- Date: Wed, 18 Jan 2023 17:19:04 GMT
- Title: An investigation of the reconstruction capacity of stacked convolutional
autoencoders for log-mel-spectrograms
- Authors: Anastasia Natsiou, Luca Longo, Sean O'Leary
- Abstract summary: In audio processing applications, the generation of expressive sounds based on high-level representations demonstrates a high demand.
Modern algorithms, such as neural networks, have inspired the development of expressive synthesizers based on musical instrument compression.
This study investigates the use of stacked convolutional autoencoders for the compression of time-frequency audio representations for a variety of instruments for a single pitch.
- Score: 2.3204178451683264
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In audio processing applications, the generation of expressive sounds based
on high-level representations demonstrates a high demand. These representations
can be used to manipulate the timbre and influence the synthesis of creative
instrumental notes. Modern algorithms, such as neural networks, have inspired
the development of expressive synthesizers based on musical instrument timbre
compression. Unsupervised deep learning methods can achieve audio compression
by training the network to learn a mapping from waveforms or spectrograms to
low-dimensional representations. This study investigates the use of stacked
convolutional autoencoders for the compression of time-frequency audio
representations for a variety of instruments for a single pitch. Further
exploration of hyper-parameters and regularization techniques is demonstrated
to enhance the performance of the initial design. In an unsupervised manner,
the network is able to reconstruct a monophonic and harmonic sound based on
latent representations. In addition, we introduce an evaluation metric to
measure the similarity between the original and reconstructed samples.
Evaluating a deep generative model for the synthesis of sound is a challenging
task. Our approach is based on the accuracy of the generated frequencies as it
presents a significant metric for the perception of harmonic sounds. This work
is expected to accelerate future experiments on audio compression using neural
autoencoders.
Related papers
- From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - High Fidelity Neural Audio Compression [92.4812002532009]
We introduce a state-of-the-art real-time, high-fidelity, audio leveraging neural networks.
It consists in a streaming encoder-decoder architecture with quantized latent space trained in an end-to-end fashion.
We simplify and speed-up the training by using a single multiscale spectrogram adversary.
arXiv Detail & Related papers (2022-10-24T17:52:02Z) - Multi-instrument Music Synthesis with Spectrogram Diffusion [19.81982315173444]
We focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime.
We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter.
We find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
arXiv Detail & Related papers (2022-06-11T03:26:15Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - Neural Waveshaping Synthesis [0.0]
We present a novel, lightweight, fully causal approach to neural audio synthesis.
The Neural Waveshaping Unit (NEWT) operates directly in the waveform domain.
It produces complex timbral evolutions by simple affine transformations of its input and output signals.
arXiv Detail & Related papers (2021-07-11T13:50:59Z) - Training a Deep Neural Network via Policy Gradients for Blind Source
Separation in Polyphonic Music Recordings [1.933681537640272]
We propose a method for the blind separation of sounds of musical instruments in audio signals.
We describe the individual tones via a parametric model, training a dictionary to capture the relative amplitudes of the harmonics.
Our algorithm yields high-quality results with particularly low interference on a variety of different audio samples.
arXiv Detail & Related papers (2021-07-09T06:17:04Z) - Deep Convolutional and Recurrent Networks for Polyphonic Instrument
Classification from Monophonic Raw Audio Waveforms [30.3491261167433]
Sound Event Detection and Audio Classification tasks are traditionally addressed through time-frequency representations of audio signals such as spectrograms.
Deep neural networks as efficient feature extractors has enabled the direct use of audio signals for classification purposes.
We attempt to recognize musical instruments in polyphonic audio by only feeding their raw waveforms into deep learning models.
arXiv Detail & Related papers (2021-02-13T13:44:46Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z) - Timbre latent space: exploration and creative aspects [1.3764085113103222]
Recent studies show the ability of unsupervised models to learn invertible audio representations using Auto-Encoders.
New possibilities for timbre manipulations are enabled with generative neural networks.
arXiv Detail & Related papers (2020-08-04T07:08:04Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.