Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting
- URL: http://arxiv.org/abs/2407.01499v1
- Date: Mon, 1 Jul 2024 17:43:45 GMT
- Title: Pictures Of MIDI: Controlled Music Generation via Graphical Prompts for Image-Based Diffusion Inpainting
- Authors: Scott H. Hawley,
- Abstract summary: This study explores a user-friendly graphical interface enabling the drawing of masked regions for inpainting by an Hourglass Diffusion Transformer (HDiT) model trained on MIDI piano roll images.
We demonstrate that, in addition to inpainting of melodies, accompaniment, and continuations, the use of repainting can help increase note density yielding musical structures closely matching user specifications.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have witnessed significant progress in generative models for music, featuring diverse architectures that balance output quality, diversity, speed, and user control. This study explores a user-friendly graphical interface enabling the drawing of masked regions for inpainting by an Hourglass Diffusion Transformer (HDiT) model trained on MIDI piano roll images. To enhance note generation in specified areas, masked regions can be "repainted" with extra noise. The non-latent HDiTs linear scaling with pixel count allows efficient generation in pixel space, providing intuitive and interpretable controls such as masking throughout the network and removing the need to operate in compressed latent spaces such as those provided by pretrained autoencoders. We demonstrate that, in addition to inpainting of melodies, accompaniment, and continuations, the use of repainting can help increase note density yielding musical structures closely matching user specifications such as rising, falling, or diverging melody and/or accompaniment, even when these lie outside the typical training data distribution. We achieve performance on par with prior results while operating at longer context windows, with no autoencoder, and can enable complex geometries for inpainting masks, increasing the options for machine-assisted composers to control the generated music.
Related papers
- Network Bending of Diffusion Models for Audio-Visual Generation [0.09558392439655014]
We present the first steps towards the creation of a tool which enables artists to create music visualizations.
We investigate the application of network bending, the process of applying transforms within the layers of a generative network, to image generation diffusion models.
We generate music-reactive videos using Stable Diffusion by passing audio features as parameters to network bending operators.
arXiv Detail & Related papers (2024-06-28T00:39:17Z) - Trusted Video Inpainting Localization via Deep Attentive Noise Learning [2.1210527985139227]
We present a Trusted Video Inpainting localization network (TruVIL) with excellent robustness and generalization ability.
We design deep attentive noise learning in multiple stages to capture the inpainted traces.
To prepare enough training samples, we also build a frame-level video object segmentation dataset of 2500 videos.
arXiv Detail & Related papers (2024-06-19T14:08:58Z) - Lazy Diffusion Transformer for Interactive Image Editing [79.75128130739598]
We introduce a novel diffusion transformer, LazyDiffusion, that generates partial image updates efficiently.
Our approach targets interactive image editing applications in which, starting from a blank canvas or an image, a user specifies a sequence of localized image modifications.
arXiv Detail & Related papers (2024-04-18T17:59:27Z) - Composer: Creative and Controllable Image Synthesis with Composable
Conditions [57.78533372393828]
Recent large-scale generative models learned on big data are capable of synthesizing incredible images yet suffer from limited controllability.
This work offers a new generation paradigm that allows flexible control of the output image, such as spatial layout and palette, while maintaining the synthesis quality and model creativity.
arXiv Detail & Related papers (2023-02-20T05:48:41Z) - Differentiable Soft-Masked Attention [115.5770357189209]
"Differentiable Soft-Masked Attention" is used for the task of WeaklySupervised Video Object.
We develop a transformer-based network for training, but can also benefit from cycle consistency training on a video with just one annotated frame.
arXiv Detail & Related papers (2022-06-01T02:05:13Z) - MAT: Mask-Aware Transformer for Large Hole Image Inpainting [79.67039090195527]
We present a novel model for large hole inpainting, which unifies the merits of transformers and convolutions.
Experiments demonstrate the state-of-the-art performance of the new model on multiple benchmark datasets.
arXiv Detail & Related papers (2022-03-29T06:36:17Z) - The Piano Inpainting Application [0.0]
generative algorithms are still not widely used by artists due to the limited control they offer, prohibitive inference times or the lack of integration within musicians' generate.
In this work, we present the Piano Inpainting Application (PIA), a generative model focused on inpainting piano performances.
arXiv Detail & Related papers (2021-07-13T09:33:11Z) - Spectrogram Inpainting for Interactive Generation of Instrument Sounds [1.7205106391379026]
We cast the generation of individual instrumental notes as an inpainting-based task, introducing novel and unique ways to iteratively shape sounds.
Most crucially, we open-source an interactive web interface to transform sounds by inpainting, for artists and practitioners alike, opening up to new, creative uses.
arXiv Detail & Related papers (2021-04-15T15:17:31Z) - Free-Form Image Inpainting via Contrastive Attention Network [64.05544199212831]
In image inpainting tasks, masks with any shapes can appear anywhere in images which form complex patterns.
It is difficult for encoders to capture such powerful representations under this complex situation.
We propose a self-supervised Siamese inference network to improve the robustness and generalization.
arXiv Detail & Related papers (2020-10-29T14:46:05Z) - Look here! A parametric learning based approach to redirect visual
attention [49.609412873346386]
We introduce an automatic method to make an image region more attention-capturing via subtle image edits.
Our model predicts a distinct set of global parametric transformations to be applied to the foreground and background image regions.
Our edits enable inference at interactive rates on any image size, and easily generalize to videos.
arXiv Detail & Related papers (2020-08-12T16:08:36Z) - Learning Style-Aware Symbolic Music Representations by Adversarial
Autoencoders [9.923470453197657]
We focus on leveraging adversarial regularization as a flexible and natural mean to imbue variational autoencoders with context information.
We introduce the first Music Adversarial Autoencoder (MusAE)
Our model has a higher reconstruction accuracy than state-of-the-art models based on standard variational autoencoders.
arXiv Detail & Related papers (2020-01-15T18:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.