Related papers: Musical Voice Separation as Link Prediction: Modeling a Musical Perception Task as a Multi-Trajectory Tracking Problem

Musical Voice Separation as Link Prediction: Modeling a Musical Perception Task as a Multi-Trajectory Tracking Problem

URL: http://arxiv.org/abs/2304.14848v1
Date: Fri, 28 Apr 2023 13:48:00 GMT
Title: Musical Voice Separation as Link Prediction: Modeling a Musical Perception Task as a Multi-Trajectory Tracking Problem
Authors: Emmanouil Karystinaios, Francesco Foscarin, Gerhard Widmer
Abstract summary: This paper targets the perceptual task of separating the different interacting voices, i.e., monophonic melodic streams, in a polyphonic musical piece. We model this task as a Multi-Trajectory Tracking (MTT) problem from discrete observations, i.e. notes in a pitch-time space. Our approach builds a graph from a musical piece, by creating one node for every note, and separates the melodic trajectories by predicting a link between two notes if they are consecutive in the same voice/stream.
Score: 6.617487928813374
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper targets the perceptual task of separating the different interacting voices, i.e., monophonic melodic streams, in a polyphonic musical piece. We target symbolic music, where notes are explicitly encoded, and model this task as a Multi-Trajectory Tracking (MTT) problem from discrete observations, i.e., notes in a pitch-time space. Our approach builds a graph from a musical piece, by creating one node for every note, and separates the melodic trajectories by predicting a link between two notes if they are consecutive in the same voice/stream. This kind of local, greedy prediction is made possible by node embeddings created by a heterogeneous graph neural network that can capture inter- and intra-trajectory information. Furthermore, we propose a new regularization loss that encourages the output to respect the MTT premise of at most one incoming and one outgoing link for every node, favouring monophonic (voice) trajectories; this loss function might also be useful in other general MTT scenarios. Our approach does not use domain-specific heuristics, is scalable to longer sequences and a higher number of voices, and can handle complex cases such as voice inversions and overlaps. We reach new state-of-the-art results for the voice separation task in classical music of different styles.

Related papers

Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score Engraving [5.572472212662453]
This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves. We propose an end-to-end system based on graph neural networks that notes that belong to the same chord and connect them with edges if they are part of a voice.
arXiv Detail & Related papers (2024-07-15T14:36:13Z)
Cadence Detection in Symbolic Classical Music using Graph Neural Networks [7.817685358710508]
We present a graph representation of symbolic scores as an intermediate means to solve the cadence detection task. We approach cadence detection as an imbalanced node classification problem using a Graph Convolutional Network. Our experiments suggest that graph convolution can learn non-local features that assist in cadence detection, freeing us from the need of having to devise specialized features that encode non-local context.
arXiv Detail & Related papers (2022-08-31T12:39:57Z)
SinTra: Learning an inspiration model from a single multi-track music segment [6.841442146760313]
SinTra is an auto-regressive sequential generative model that can learn from a single multi-track music segment. We show that SinTra can learn information from a single music segment more sufficiently than Music Transformer.
arXiv Detail & Related papers (2022-04-21T07:13:30Z)
TONet: Tone-Octave Network for Singing Melody Extraction from Polyphonic Music [43.17623332544677]
TONet is a plug-and-play model that improves both tone and octave perceptions. We present an improved input representation, the Tone-CFP, that explicitly groups harmonics. Third, we propose a tone-octave fusion mechanism to improve the final salience feature map.
arXiv Detail & Related papers (2022-02-02T10:55:48Z)
Visual Scene Graphs for Audio Source Separation [65.47212419514761]
State-of-the-art approaches for visually-guided audio source separation typically assume sources that have characteristic sounds, such as musical instruments. We propose Audio Visual Scene Graph Segmenter (AVSGS), a novel deep learning model that embeds the visual structure of the scene as a graph and segments this graph into subgraphs. Our pipeline is trained end-to-end via a self-supervised task consisting of separating audio sources using the visual graph from artificially mixed sounds.
arXiv Detail & Related papers (2021-09-24T13:40:51Z)
DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
Looking into Your Speech: Learning Cross-modal Affinity for Audio-visual Speech Separation [73.1652905564163]
We address the problem of separating individual speech signals from videos using audio-visual neural processing. Most conventional approaches utilize frame-wise matching criteria to extract shared information between co-occurring audio and video. We propose a cross-modal affinity network (CaffNet) that learns global correspondence as well as locally-varying affinities between audio and visual streams.
arXiv Detail & Related papers (2021-03-25T15:39:12Z)
From Note-Level to Chord-Level Neural Network Models for Voice Separation in Symbolic Music [0.0]
We train neural networks that assign notes to voices either separately for each note in a chord (note-level), or jointly to all notes in a chord (chord-level) Both models surpass a strong baseline based on an iterative application of an envelope extraction function. The two models are also shown to outperform previous approaches on separating the voices in Bach music.
arXiv Detail & Related papers (2020-11-05T18:39:42Z)
Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach. In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module. Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features. We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution. We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)
Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework. It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.