Music Separation Enhancement with Generative Modeling
- URL: http://arxiv.org/abs/2208.12387v1
- Date: Fri, 26 Aug 2022 00:44:37 GMT
- Title: Music Separation Enhancement with Generative Modeling
- Authors: Noah Schaffer, Boaz Cogan, Ethan Manilow, Max Morrison, Prem
Seetharaman, and Bryan Pardo
- Abstract summary: We propose a post-processing model (the Make it Sound Good) to enhance the output of music source separation systems.
Crowdsourced subjective evaluations demonstrate that human listeners prefer source estimates of bass and drums that have been post-processed by MSG.
- Score: 11.545349346125743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite phenomenal progress in recent years, state-of-the-art music
separation systems produce source estimates with significant perceptual
shortcomings, such as adding extraneous noise or removing harmonics. We propose
a post-processing model (the Make it Sound Good (MSG) post-processor) to
enhance the output of music source separation systems. We apply our
post-processing model to state-of-the-art waveform-based and spectrogram-based
music source separators, including a separator unseen by MSG during training.
Our analysis of the errors produced by source separators shows that waveform
models tend to introduce more high-frequency noise, while spectrogram models
tend to lose transients and high frequency content. We introduce objective
measures to quantify both kinds of errors and show MSG improves the source
reconstruction of both kinds of errors. Crowdsourced subjective evaluations
demonstrate that human listeners prefer source estimates of bass and drums that
have been post-processed by MSG.
Related papers
- An Ensemble Approach to Music Source Separation: A Comparative Analysis of Conventional and Hierarchical Stem Separation [0.4893345190925179]
Music source separation (MSS) is a task that involves isolating individual sound sources, or stems, from mixed audio signals.
This paper presents an ensemble approach to MSS, combining several state-of-the-art architectures to achieve superior separation performance.
arXiv Detail & Related papers (2024-10-28T06:18:12Z) - Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation [0.0]
This study tackles the distinct separation of vocal components from musical spectrograms.
We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms.
We implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately.
arXiv Detail & Related papers (2024-05-30T13:47:53Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Separate And Diffuse: Using a Pretrained Diffusion Model for Improving
Source Separation [99.19786288094596]
We show how the upper bound can be generalized to the case of random generative models.
We show state-of-the-art results on 2, 3, 5, 10, and 20 speakers on multiple benchmarks.
arXiv Detail & Related papers (2023-01-25T18:21:51Z) - Music Source Separation with Band-split RNN [25.578400006180527]
We propose a frequency-domain model that splits the spectrogram of the mixture into subbands and perform interleaved band-level and sequence-level modeling.
The choices of the bandwidths of the subbands can be determined by a priori knowledge or expert knowledge on the characteristics of the target source.
Experiment results show that BSRNN trained only on MUSDB18-HQ dataset significantly outperforms several top-ranking models in Music Demixing (MDX) Challenge 2021.
arXiv Detail & Related papers (2022-09-30T01:49:52Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z) - HpRNet : Incorporating Residual Noise Modeling for Violin in a
Variational Parametric Synthesizer [11.4219428942199]
We introduce a dataset of Carnatic Violin Recordings where bow noise is an integral part of the playing style of higher pitched notes.
We obtain insights about each of the harmonic and residual components of the signal, as well as their interdependence.
arXiv Detail & Related papers (2020-08-19T12:48:32Z) - Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU.
The proposed model is based on an encoder-decoder architecture with skip-connections.
It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.