Score-informed Music Source Separation: Improving Synthetic-to-real Generalization in Classical Music
- URL: http://arxiv.org/abs/2503.07352v1
- Date: Mon, 10 Mar 2025 14:08:31 GMT
- Title: Score-informed Music Source Separation: Improving Synthetic-to-real Generalization in Classical Music
- Authors: Eetu Tunturi, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen,
- Abstract summary: Music source separation is the task of separating a mixture of instruments into constituent tracks.<n>We propose two ways of using musical scores to aid music source separation: a score-informed model and a score-only model.<n>The score-informed model improves separation results compared to a baseline approach, but struggles to generalize from synthetic to real data.
- Score: 8.468436398420764
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Music source separation is the task of separating a mixture of instruments into constituent tracks. Music source separation models are typically trained using only audio data, although additional information can be used to improve the model's separation capability. In this paper, we propose two ways of using musical scores to aid music source separation: a score-informed model where the score is concatenated with the magnitude spectrogram of the audio mixture as the input of the model, and a model where we use only the score to calculate the separation mask. We train our models on synthetic data in the SynthSOD dataset and evaluate our methods on the URMP and Aalto anechoic orchestra datasets, comprised of real recordings. The score-informed model improves separation results compared to a baseline approach, but struggles to generalize from synthetic to real data, whereas the score-only model shows a clear improvement in synthetic-to-real generalization.
Related papers
- Unleashing the Power of Natural Audio Featuring Multiple Sound Sources [54.38251699625379]
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio.
We propose ClearSep, a framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks.
In experiments, ClearSep achieves state-of-the-art performance across multiple sound separation tasks.
arXiv Detail & Related papers (2025-04-24T17:58:21Z) - Separate This, and All of these Things Around It: Music Source Separation via Hyperellipsoidal Queries [53.30852012059025]
Music source separation is an audio-to-audio retrieval task.<n>Recent work in music source separation has begun to challenge the fixed-stem paradigm.<n>We propose the use of hyperellipsoidal regions as queries to allow for an intuitive yet easily parametrizable approach to specifying both the target (location) and its spread.
arXiv Detail & Related papers (2025-01-27T16:13:50Z) - Universal Sound Separation with Self-Supervised Audio Masked Autoencoder [35.560261097213846]
We propose integrating a self-supervised pre-trained model, namely the audio masked autoencoder (A-MAE), into a universal sound separation system.
The proposed methods successfully enhance the separation performance of a state-of-the-art ResUNet-based USS model.
arXiv Detail & Related papers (2024-07-16T14:11:44Z) - Naturalistic Music Decoding from EEG Data via Latent Diffusion Models [14.882764251306094]
This study represents an initial foray into achieving general music reconstruction of high-quality using non-invasive EEG data.
We train our models on the public NMED-T dataset and perform quantitative evaluation proposing neural embedding-based metrics.
arXiv Detail & Related papers (2024-05-15T03:26:01Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Benchmarks and leaderboards for sound demixing tasks [44.99833362998488]
We introduce two new benchmarks for the sound source separation tasks.
We compare popular models for sound demixing, as well as their ensembles, on these benchmarks.
We also develop a novel approach for audio separation, based on the ensembling of different models that are suited best for the particular stem.
arXiv Detail & Related papers (2023-05-12T14:00:26Z) - Multi-Source Diffusion Models for Simultaneous Music Generation and Separation [17.124189082882395]
We train our model on Slakh2100, a standard dataset for musical source separation.
Our method is the first example of a single model that can handle both generation and separation tasks.
arXiv Detail & Related papers (2023-02-04T23:18:36Z) - Music Separation Enhancement with Generative Modeling [11.545349346125743]
We propose a post-processing model (the Make it Sound Good) to enhance the output of music source separation systems.
Crowdsourced subjective evaluations demonstrate that human listeners prefer source estimates of bass and drums that have been post-processed by MSG.
arXiv Detail & Related papers (2022-08-26T00:44:37Z) - BinauralGrad: A Two-Stage Conditional Diffusion Probabilistic Model for
Binaural Audio Synthesis [129.86743102915986]
We formulate the synthesis process from a different perspective by decomposing the audio into a common part.
We propose BinauralGrad, a novel two-stage framework equipped with diffusion models to synthesize them respectively.
Experiment results show that BinauralGrad outperforms the existing baselines by a large margin in terms of both object and subject evaluation metrics.
arXiv Detail & Related papers (2022-05-30T02:09:26Z) - A Unified Model for Zero-shot Music Source Separation, Transcription and
Synthesis [13.263771543118994]
We propose a unified model for three inter-related tasks: 1) to textitseparate individual sound sources from a mixed music audio, 2) to textittranscribe each sound source to MIDI notes, and 3) totextit synthesize new pieces based on the timbre of separated sources.
The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre.
arXiv Detail & Related papers (2021-08-07T14:28:21Z) - Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated.
We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.