TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic
Music
- URL: http://arxiv.org/abs/2202.00951v1
- Date: Wed, 2 Feb 2022 10:55:48 GMT
- Title: TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic
Music
- Authors: Ke Chen, Shuai Yu, Cheng-i Wang, Wei Li, Taylor Berg-Kirkpatrick,
Shlomo Dubnov
- Abstract summary: TONet is a plug-and-play model that improves both tone and octave perceptions.
We present an improved input representation, the Tone-CFP, that explicitly groups harmonics.
Third, we propose a tone-octave fusion mechanism to improve the final salience feature map.
- Score: 43.17623332544677
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Singing melody extraction is an important problem in the field of music
information retrieval. Existing methods typically rely on frequency-domain
representations to estimate the sung frequencies. However, this design does not
lead to human-level performance in the perception of melody information for
both tone (pitch-class) and octave. In this paper, we propose TONet, a
plug-and-play model that improves both tone and octave perceptions by
leveraging a novel input representation and a novel network architecture.
First, we present an improved input representation, the Tone-CFP, that
explicitly groups harmonics via a rearrangement of frequency-bins. Second, we
introduce an encoder-decoder architecture that is designed to obtain a salience
feature map, a tone feature map, and an octave feature map. Third, we propose a
tone-octave fusion mechanism to improve the final salience feature map.
Experiments are done to verify the capability of TONet with various baseline
backbone models. Our results show that tone-octave fusion with Tone-CFP can
significantly improve the singing voice extraction performance across various
datasets -- with substantial gains in octave and tone accuracy.
Related papers
- Sine, Transient, Noise Neural Modeling of Piano Notes [0.0]
Three sub-modules learn components from piano recordings and generate harmonic, transient, and noise signals.
From singular notes, we emulate the coupling between different keys in trichords with a convolutional-based network.
Results show the model matches the partial distribution of the target while predicting the energy in the higher part of the spectrum presents more challenges.
arXiv Detail & Related papers (2024-09-10T13:48:18Z) - Towards Improving Harmonic Sensitivity and Prediction Stability for
Singing Melody Extraction [36.45127093978295]
We propose an input feature modification and a training objective modification based on two assumptions.
To enhance the model's sensitivity on the trailing harmonics, we modify the Combined Frequency and Periodicity representation using discrete z-transform.
We apply these modifications to several models, including MSNet, FTANet, and a newly introduced model, PianoNet, modified from a piano transcription network.
arXiv Detail & Related papers (2023-08-04T21:59:40Z) - Multitrack Music Transcription with a Time-Frequency Perceiver [6.617487928813374]
Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously.
We propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription.
arXiv Detail & Related papers (2023-06-19T08:58:26Z) - Melody transcription via generative pre-training [86.08508957229348]
Key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles.
To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio.
We derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music.
arXiv Detail & Related papers (2022-12-04T18:09:23Z) - Multi-instrument Music Synthesis with Spectrogram Diffusion [19.81982315173444]
We focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime.
We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter.
We find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
arXiv Detail & Related papers (2022-06-11T03:26:15Z) - Pitch-Informed Instrument Assignment Using a Deep Convolutional Network
with Multiple Kernel Shapes [22.14133334414372]
This paper proposes a deep convolutional neural network for performing note-level instrument assignment.
Experiments on the MusicNet dataset using 7 instrument classes show that our approach is able to achieve an average F-score of 0.904.
arXiv Detail & Related papers (2021-07-28T19:48:09Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Hierarchical Timbre-Painting and Articulation Generation [92.59388372914265]
We present a fast and high-fidelity method for music generation, based on specified f0 and loudness.
The synthesized audio mimics the timbre and articulation of a target instrument.
arXiv Detail & Related papers (2020-08-30T05:27:39Z) - Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated.
We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.