DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation
- URL: http://arxiv.org/abs/2408.10807v1
- Date: Tue, 20 Aug 2024 12:56:49 GMT
- Title: DisMix: Disentangling Mixtures of Musical Instruments for Source-level Pitch and Timbre Manipulation
- Authors: Yin-Jyun Luo, Kin Wai Cheuk, Woosung Choi, Toshimitsu Uesaka, Keisuke Toyama, Koichi Saito, Chieh-Hsin Lai, Yuhta Takida, Wei-Hsiang Liao, Simon Dixon, Yuki Mitsufuji,
- Abstract summary: DisMix is a generative framework in which the pitch and timbre representations act as building blocks for constructing the melody and instrument of a source.
By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments.
We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations.
- Score: 21.06957311285177
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing work on pitch and timbre disentanglement has been mostly focused on single-instrument music audio, excluding the cases where multiple instruments are presented. To fill the gap, we propose DisMix, a generative framework in which the pitch and timbre representations act as modular building blocks for constructing the melody and instrument of a source, and the collection of which forms a set of per-instrument latent representations underlying the observed mixture. By manipulating the representations, our model samples mixtures with novel combinations of pitch and timbre of the constituent instruments. We can jointly learn the disentangled pitch-timbre representations and a latent diffusion transformer that reconstructs the mixture conditioned on the set of source-level representations. We evaluate the model using both a simple dataset of isolated chords and a realistic four-part chorales in the style of J.S. Bach, identify the key components for the success of disentanglement, and demonstrate the application of mixture transformation based on source-level attribute manipulation.
Related papers
- Separate This, and All of these Things Around It: Music Source Separation via Hyperellipsoidal Queries [53.30852012059025]
Music source separation is an audio-to-audio retrieval task.
Recent work in music source separation has begun to challenge the fixed-stem paradigm.
We propose the use of hyperellipsoidal regions as queries to allow for an intuitive yet easily parametrizable approach to specifying both the target (location) and its spread.
arXiv Detail & Related papers (2025-01-27T16:13:50Z) - Subtractive Training for Music Stem Insertion using Latent Diffusion Models [35.91945598575059]
We present Subtractive Training, a method for synthesizing individual musical instrument stems given other instruments as context.
Our results demonstrate Subtractive Training's efficacy in creating authentic drum stems that seamlessly blend with the existing tracks.
We extend this technique to MIDI formats, successfully generating compatible bass, drum, and guitar parts for incomplete arrangements.
arXiv Detail & Related papers (2024-06-27T16:59:14Z) - PowMix: A Versatile Regularizer for Multimodal Sentiment Analysis [71.8946280170493]
This paper introduces PowMix, a versatile embedding space regularizer that builds upon the strengths of unimodal mixing-based regularization approaches.
PowMix is integrated before the fusion stage of multimodal architectures and facilitates intra-modal mixing, such as mixing text with text, to act as a regularizer.
arXiv Detail & Related papers (2023-12-19T17:01:58Z) - Performance Conditioning for Diffusion-Based Multi-Instrument Music
Synthesis [15.670399197114012]
We propose enhancing control of multi-instrument synthesis by conditioning a generative model on a specific performance and recording environment.
Performance conditioning is a tool indicating the generative model to synthesize music with style and timbre of specific instruments taken from specific performances.
Our prototype is evaluated using uncurated performances with diverse instrumentation and state-of-the-art FAD realism scores.
arXiv Detail & Related papers (2023-09-21T17:44:57Z) - Multi-Source Diffusion Models for Simultaneous Music Generation and Separation [17.124189082882395]
We train our model on Slakh2100, a standard dataset for musical source separation.
Our method is the first example of a single model that can handle both generation and separation tasks.
arXiv Detail & Related papers (2023-02-04T23:18:36Z) - A-Muze-Net: Music Generation by Composing the Harmony based on the
Generated Melody [91.22679787578438]
We present a method for the generation of Midi files of piano music.
The method models the right and left hands using two networks, where the left hand is conditioned on the right hand.
The Midi is represented in a way that is invariant to the musical scale, and the melody is represented, for the purpose of conditioning the harmony.
arXiv Detail & Related papers (2021-11-25T09:45:53Z) - A Unified Model for Zero-shot Music Source Separation, Transcription and
Synthesis [13.263771543118994]
We propose a unified model for three inter-related tasks: 1) to textitseparate individual sound sources from a mixed music audio, 2) to textittranscribe each sound source to MIDI notes, and 3) totextit synthesize new pieces based on the timbre of separated sources.
The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre.
arXiv Detail & Related papers (2021-08-07T14:28:21Z) - Modeling the Compatibility of Stem Tracks to Generate Music Mashups [6.922825755771942]
A music mashup combines audio elements from two or more songs to create a new work.
Research has developed algorithms that predict the compatibility of audio elements.
arXiv Detail & Related papers (2021-03-26T01:51:11Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.