Unsupervised Learning of Deep Features for Music Segmentation
- URL: http://arxiv.org/abs/2108.12955v1
- Date: Mon, 30 Aug 2021 01:55:44 GMT
- Title: Unsupervised Learning of Deep Features for Music Segmentation
- Authors: Matthew C. McCallum
- Abstract summary: Music segmentation is a problem of identifying boundaries between, and labeling, distinct music segments.
The performance of a range of music segmentation algorithms has been dependent on the audio features chosen to represent the audio.
In this work, unsupervised training of deep feature embeddings using convolutional neural networks (CNNs) is explored for music segmentation.
- Score: 8.528384027684192
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Music segmentation refers to the dual problem of identifying boundaries
between, and labeling, distinct music segments, e.g., the chorus, verse, bridge
etc. in popular music. The performance of a range of music segmentation
algorithms has been shown to be dependent on the audio features chosen to
represent the audio. Some approaches have proposed learning feature
transformations from music segment annotation data, although, such data is time
consuming or expensive to create and as such these approaches are likely
limited by the size of their datasets. While annotated music segmentation data
is a scarce resource, the amount of available music audio is much greater. In
the neighboring field of semantic audio unsupervised deep learning has shown
promise in improving the performance of solutions to the query-by-example and
sound classification tasks. In this work, unsupervised training of deep feature
embeddings using convolutional neural networks (CNNs) is explored for music
segmentation. The proposed techniques exploit only the time proximity of audio
features that is implicit in any audio timeline. Employing these embeddings in
a classic music segmentation algorithm is shown not only to significantly
improve the performance of this algorithm, but obtain state of the art
performance in unsupervised music segmentation.
Related papers
- LARP: Language Audio Relational Pre-training for Cold-Start Playlist Continuation [49.89372182441713]
We introduce LARP, a multi-modal cold-start playlist continuation model.
Our framework uses increasing stages of task-specific abstraction: within-track (language-audio) contrastive loss, track-track contrastive loss, and track-playlist contrastive loss.
arXiv Detail & Related papers (2024-06-20T14:02:15Z) - WikiMuTe: A web-sourced dataset of semantic descriptions for music audio [7.4327407361824935]
We present WikiMuTe, a new and open dataset containing rich semantic descriptions of music.
The data is sourced from Wikipedia's rich catalogue of articles covering musical works.
We train a model that jointly learns text and audio representations and performs cross-modal retrieval.
arXiv Detail & Related papers (2023-12-14T18:38:02Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - GETMusic: Generating Any Music Tracks with a Unified Representation and
Diffusion Framework [58.64512825534638]
Symbolic music generation aims to create musical notes, which can help users compose music.
We introduce a framework known as GETMusic, with GET'' standing for GEnerate music Tracks''
GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time.
Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations.
arXiv Detail & Related papers (2023-05-18T09:53:23Z) - Symbolic Music Structure Analysis with Graph Representations and
Changepoint Detection Methods [1.1677169430445211]
We propose three methods to segment symbolic music by its form or structure: Norm, G-PELT and G-Window.
We have found that encoding symbolic music with graph representations and computing the novelty of Adjacency Matrices represent the structure of symbolic music pieces well.
arXiv Detail & Related papers (2023-03-24T09:45:11Z) - MATT: A Multiple-instance Attention Mechanism for Long-tail Music Genre
Classification [1.8275108630751844]
Imbalanced music genre classification is a crucial task in the Music Information Retrieval (MIR) field.
Most of the existing models are designed for class-balanced music datasets.
We propose a novel mechanism named Multi-instance Attention (MATT) to boost the performance for identifying tail classes.
arXiv Detail & Related papers (2022-09-09T03:52:44Z) - MusCaps: Generating Captions for Music Audio [14.335950077921435]
We present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention.
Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs.
Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding.
arXiv Detail & Related papers (2021-04-24T16:34:47Z) - Artificially Synthesising Data for Audio Classification and Segmentation
to Improve Speech and Music Detection in Radio Broadcast [0.0]
We present a novel procedure that artificially synthesises data that resembles radio signals.
We trained a Convolutional Recurrent Neural Network (CRNN) on this synthesised data and outperformed state-of-the-art algorithms for music-speech detection.
arXiv Detail & Related papers (2021-02-19T14:47:05Z) - dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains.
This will also provide a means for evaluating algorithms specifically designed for music.
The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z) - DenoiSeg: Joint Denoising and Segmentation [75.91760529986958]
We propose DenoiSeg, a new method that can be trained end-to-end on only a few annotated ground truth segmentations.
We achieve this by extending Noise2Void, a self-supervised denoising scheme that can be trained on noisy images alone, to also predict dense 3-class segmentations.
arXiv Detail & Related papers (2020-05-06T17:42:54Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.