Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation
- URL: http://arxiv.org/abs/2405.20059v1
- Date: Thu, 30 May 2024 13:47:53 GMT
- Title: Spectral Mapping of Singing Voices: U-Net-Assisted Vocal Segmentation
- Authors: Adam Sorrenti,
- Abstract summary: This study tackles the distinct separation of vocal components from musical spectrograms.
We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms.
We implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Separating vocal elements from musical tracks is a longstanding challenge in audio signal processing. This study tackles the distinct separation of vocal components from musical spectrograms. We employ the Short Time Fourier Transform (STFT) to extract audio waves into detailed frequency-time spectrograms, utilizing the benchmark MUSDB18 dataset for music separation. Subsequently, we implement a UNet neural network to segment the spectrogram image, aiming to delineate and extract singing voice components accurately. We achieved noteworthy results in audio source separation using of our U-Net-based models. The combination of frequency-axis normalization with Min/Max scaling and the Mean Absolute Error (MAE) loss function achieved the highest Source-to-Distortion Ratio (SDR) of 7.1 dB, indicating a high level of accuracy in preserving the quality of the original signal during separation. This setup also recorded impressive Source-to-Interference Ratio (SIR) and Source-to-Artifact Ratio (SAR) scores of 25.2 dB and 7.2 dB, respectively. These values significantly outperformed other configurations, particularly those using Quantile-based normalization or a Mean Squared Error (MSE) loss function. Our source code, model weights, and demo material can be found at the project's GitHub repository: https://github.com/mbrotos/SoundSeg
Related papers
- A Generalized Bandsplit Neural Network for Cinematic Audio Source
Separation [39.45425155123186]
We develop a model generalizing the Bandsplit RNN for any complete or overcomplete partitions of the frequency axis.
A loss function motivated by the signal-to-noise ratio and the sparsity-promoting property of the 1-norm was proposed.
Our best model sets the state of the art on the Divide and Remaster dataset with performance above the ideal ratio mask for the dialogue stem.
arXiv Detail & Related papers (2023-09-05T19:19:22Z) - AERO: Audio Super Resolution in the Spectral Domain [15.965382891955771]
We present AERO, a audio super-resolution model that processes speech and music signals in the spectral domain.
We optimize the model using both time and frequency domain loss functions.
We demonstrate high performance across a wide range of sample rates considering both speech and music.
arXiv Detail & Related papers (2022-11-22T12:37:01Z) - Music Separation Enhancement with Generative Modeling [11.545349346125743]
We propose a post-processing model (the Make it Sound Good) to enhance the output of music source separation systems.
Crowdsourced subjective evaluations demonstrate that human listeners prefer source estimates of bass and drums that have been post-processed by MSG.
arXiv Detail & Related papers (2022-08-26T00:44:37Z) - Masked Autoencoders that Listen [79.99280830830854]
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms.
Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers.
The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram.
arXiv Detail & Related papers (2022-07-13T17:59:55Z) - SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with
Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram.
It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z) - Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments.
We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs.
Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Deep Convolutional and Recurrent Networks for Polyphonic Instrument
Classification from Monophonic Raw Audio Waveforms [30.3491261167433]
Sound Event Detection and Audio Classification tasks are traditionally addressed through time-frequency representations of audio signals such as spectrograms.
Deep neural networks as efficient feature extractors has enabled the direct use of audio signals for classification purposes.
We attempt to recognize musical instruments in polyphonic audio by only feeding their raw waveforms into deep learning models.
arXiv Detail & Related papers (2021-02-13T13:44:46Z) - Conditioning Trick for Training Stable GANs [70.15099665710336]
We propose a conditioning trick, called difference departure from normality, applied on the generator network in response to instability issues during GAN training.
We force the generator to get closer to the departure from normality function of real samples computed in the spectral domain of Schur decomposition.
arXiv Detail & Related papers (2020-10-12T16:50:22Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.