From Note-Level to Chord-Level Neural Network Models for Voice
Separation in Symbolic Music
- URL: http://arxiv.org/abs/2011.03028v1
- Date: Thu, 5 Nov 2020 18:39:42 GMT
- Title: From Note-Level to Chord-Level Neural Network Models for Voice
Separation in Symbolic Music
- Authors: Patrick Gray and Razvan Bunescu
- Abstract summary: We train neural networks that assign notes to voices either separately for each note in a chord (note-level), or jointly to all notes in a chord (chord-level)
Both models surpass a strong baseline based on an iterative application of an envelope extraction function.
The two models are also shown to outperform previous approaches on separating the voices in Bach music.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Music is often experienced as a progression of concurrent streams of notes,
or voices. The degree to which this happens depends on the position along a
voice-leading continuum, ranging from monophonic, to homophonic, to polyphonic,
which complicates the design of automatic voice separation models. We address
this continuum by defining voice separation as the task of decomposing music
into streams that exhibit both a high degree of external perceptual separation
from the other streams and a high degree of internal perceptual consistency.
The proposed voice separation task allows for a voice to diverge to multiple
voices and also for multiple voices to converge to the same voice. Equipped
with this flexible task definition, we manually annotated a corpus of popular
music and used it to train neural networks that assign notes to voices either
separately for each note in a chord (note-level), or jointly to all notes in a
chord (chord-level). The trained neural models greedily assign notes to voices
in a left to right traversal of the input chord sequence, using a diverse set
of perceptually informed input features. When evaluated on the extraction of
consecutive within voice note pairs, both models surpass a strong baseline
based on an iterative application of an envelope extraction function, with the
chord-level model consistently edging out the note-level model. The two models
are also shown to outperform previous approaches on separating the voices in
Bach music.
Related papers
- Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score Engraving [5.572472212662453]
This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves.
We propose an end-to-end system based on graph neural networks that notes that belong to the same chord and connect them with edges if they are part of a voice.
arXiv Detail & Related papers (2024-07-15T14:36:13Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Musical Voice Separation as Link Prediction: Modeling a Musical
Perception Task as a Multi-Trajectory Tracking Problem [6.617487928813374]
This paper targets the perceptual task of separating the different interacting voices, i.e., monophonic melodic streams, in a polyphonic musical piece.
We model this task as a Multi-Trajectory Tracking (MTT) problem from discrete observations, i.e. notes in a pitch-time space.
Our approach builds a graph from a musical piece, by creating one node for every note, and separates the melodic trajectories by predicting a link between two notes if they are consecutive in the same voice/stream.
arXiv Detail & Related papers (2023-04-28T13:48:00Z) - Karaoker: Alignment-free singing voice synthesis with speech training
data [3.9795908407245055]
Karaoker is a multispeaker Tacotron-based model conditioned on voice characteristic features.
The model is jointly conditioned with a single deep convolutional encoder on continuous data.
We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks.
arXiv Detail & Related papers (2022-04-08T15:33:59Z) - VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer [4.167459103689587]
This paper presents an audio-visual approach for voice separation.
It outperforms state-of-the-art methods at a low latency in two scenarios: speech and singing voice.
arXiv Detail & Related papers (2022-03-08T14:08:47Z) - A Unified Model for Zero-shot Music Source Separation, Transcription and
Synthesis [13.263771543118994]
We propose a unified model for three inter-related tasks: 1) to textitseparate individual sound sources from a mixed music audio, 2) to textittranscribe each sound source to MIDI notes, and 3) totextit synthesize new pieces based on the timbre of separated sources.
The model is inspired by the fact that when humans listen to music, our minds can not only separate the sounds of different instruments, but also at the same time perceive high-level representations such as score and timbre.
arXiv Detail & Related papers (2021-08-07T14:28:21Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.