Spectrogram Inpainting for Interactive Generation of Instrument Sounds
- URL: http://arxiv.org/abs/2104.07519v1
- Date: Thu, 15 Apr 2021 15:17:31 GMT
- Title: Spectrogram Inpainting for Interactive Generation of Instrument Sounds
- Authors: Th\'eis Bazin and Ga\"etan Hadjeres and Philippe Esling and Mikhail
Malt
- Abstract summary: We cast the generation of individual instrumental notes as an inpainting-based task, introducing novel and unique ways to iteratively shape sounds.
Most crucially, we open-source an interactive web interface to transform sounds by inpainting, for artists and practitioners alike, opening up to new, creative uses.
- Score: 1.7205106391379026
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern approaches to sound synthesis using deep neural networks are hard to
control, especially when fine-grained conditioning information is not
available, hindering their adoption by musicians.
In this paper, we cast the generation of individual instrumental notes as an
inpainting-based task, introducing novel and unique ways to iteratively shape
sounds. To this end, we propose a two-step approach: first, we adapt the
VQ-VAE-2 image generation architecture to spectrograms in order to convert
real-valued spectrograms into compact discrete codemaps, we then implement
token-masked Transformers for the inpainting-based generation of these
codemaps.
We apply the proposed architecture on the NSynth dataset on masked resampling
tasks. Most crucially, we open-source an interactive web interface to transform
sounds by inpainting, for artists and practitioners alike, opening up to new,
creative uses.
Related papers
- Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting [0.0]
This study explores a user-friendly graphical interface enabling the drawing of masked regions for inpainting by an Hourglass Diffusion Transformer (HDiT) model trained on MIDI piano roll images.
We demonstrate that, in addition to inpainting of melodies, accompaniment, and continuations, the use of repainting can help increase note density yielding musical structures closely matching user specifications.
arXiv Detail & Related papers (2024-07-01T17:43:45Z) - Mesh Denoising Transformer [104.5404564075393]
Mesh denoising is aimed at removing noise from input meshes while preserving their feature structures.
SurfaceFormer is a pioneering Transformer-based mesh denoising framework.
New representation known as Local Surface Descriptor captures local geometric intricacies.
Denoising Transformer module receives the multimodal information and achieves efficient global feature aggregation.
arXiv Detail & Related papers (2024-05-10T15:27:43Z) - Adaptive Human Matting for Dynamic Videos [62.026375402656754]
Adaptive Matting for Dynamic Videos, termed AdaM, is a framework for simultaneously differentiating foregrounds from backgrounds.
Two interconnected network designs are employed to achieve this goal.
We benchmark and study our methods recently introduced datasets, showing that our matting achieves new best-in-class generalizability.
arXiv Detail & Related papers (2023-04-12T17:55:59Z) - TimeMAE: Self-Supervised Representations of Time Series with Decoupled
Masked Autoencoders [55.00904795497786]
We propose TimeMAE, a novel self-supervised paradigm for learning transferrable time series representations based on transformer networks.
The TimeMAE learns enriched contextual representations of time series with a bidirectional encoding scheme.
To solve the discrepancy issue incurred by newly injected masked embeddings, we design a decoupled autoencoder architecture.
arXiv Detail & Related papers (2023-03-01T08:33:16Z) - An investigation of the reconstruction capacity of stacked convolutional
autoencoders for log-mel-spectrograms [2.3204178451683264]
In audio processing applications, the generation of expressive sounds based on high-level representations demonstrates a high demand.
Modern algorithms, such as neural networks, have inspired the development of expressive synthesizers based on musical instrument compression.
This study investigates the use of stacked convolutional autoencoders for the compression of time-frequency audio representations for a variety of instruments for a single pitch.
arXiv Detail & Related papers (2023-01-18T17:19:04Z) - Multi-instrument Music Synthesis with Spectrogram Diffusion [19.81982315173444]
We focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime.
We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter.
We find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
arXiv Detail & Related papers (2022-06-11T03:26:15Z) - Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model.
A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens.
Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z) - Multitask AET with Orthogonal Tangent Regularity for Dark Object
Detection [84.52197307286681]
We propose a novel multitask auto encoding transformation (MAET) model to enhance object detection in a dark environment.
In a self-supervision manner, the MAET learns the intrinsic visual structure by encoding and decoding the realistic illumination-degrading transformation.
We have achieved the state-of-the-art performance using synthetic and real-world datasets.
arXiv Detail & Related papers (2022-05-06T16:27:14Z) - Content-aware Warping for View Synthesis [110.54435867693203]
We propose content-aware warping, which adaptively learns the weights for pixels of a relatively large neighborhood from their contextual information via a lightweight neural network.
Based on this learnable warping module, we propose a new end-to-end learning-based framework for novel view synthesis from two source views.
Experimental results on structured light field datasets with wide baselines and unstructured multi-view datasets show that the proposed method significantly outperforms state-of-the-art methods both quantitatively and visually.
arXiv Detail & Related papers (2022-01-22T11:35:05Z) - Sequence-to-Sequence Piano Transcription with Transformers [6.177271244427368]
We show that equivalent performance can be achieved using a generic encoder-decoder Transformer with standard decoding methods.
We demonstrate that the model can learn to translate spectrogram inputs directly to MIDI-like output events for several transcription tasks.
arXiv Detail & Related papers (2021-07-19T20:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.