On the Effect of Data-Augmentation on Local Embedding Properties in the
Contrastive Learning of Music Audio Representations
- URL: http://arxiv.org/abs/2401.08889v1
- Date: Wed, 17 Jan 2024 00:12:13 GMT
- Title: On the Effect of Data-Augmentation on Local Embedding Properties in the
Contrastive Learning of Music Audio Representations
- Authors: Matthew C. McCallum, Matthew E. P. Davies, Florian Henkel, Jaehun Kim,
Samuel E. Sandberg
- Abstract summary: We show that musical properties that are homogeneous within a track are reflected in the locality of neighborhoods in the resulting embedding space.
We show that the optimal selection of data augmentation strategies for contrastive learning of music audio embeddings is dependent on the downstream task.
- Score: 6.255143207183722
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Audio embeddings are crucial tools in understanding large catalogs of music.
Typically embeddings are evaluated on the basis of the performance they provide
in a wide range of downstream tasks, however few studies have investigated the
local properties of the embedding spaces themselves which are important in
nearest neighbor algorithms, commonly used in music search and recommendation.
In this work we show that when learning audio representations on music datasets
via contrastive learning, musical properties that are typically homogeneous
within a track (e.g., key and tempo) are reflected in the locality of
neighborhoods in the resulting embedding space. By applying appropriate data
augmentation strategies, localisation of such properties can not only be
reduced but the localisation of other attributes is increased. For example,
locality of features such as pitch and tempo that are less relevant to
non-expert listeners, may be mitigated while improving the locality of more
salient features such as genre and mood, achieving state-of-the-art performance
in nearest neighbor retrieval accuracy. Similarly, we show that the optimal
selection of data augmentation strategies for contrastive learning of music
audio embeddings is dependent on the downstream task, highlighting this as an
important embedding design decision.
Related papers
- Music Auto-Tagging with Robust Music Representation Learned via Domain
Adversarial Training [18.71152526968065]
Existing models in Music Information Retrieval (MIR) struggle with real-world noise such as environmental and speech sounds in multimedia content.
This study proposes a method inspired by speech-related tasks to enhance music auto-tagging performance in noisy settings.
arXiv Detail & Related papers (2024-01-27T06:56:51Z) - Perceptual Musical Features for Interpretable Audio Tagging [2.1730712607705485]
This study explores the relevance of interpretability in the context of automatic music tagging.
We constructed a workflow that incorporates three different information extraction techniques.
We conducted experiments on two datasets, namely the MTG-Jamendo dataset and the GTZAN dataset.
arXiv Detail & Related papers (2023-12-18T14:31:58Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Adaptive Local-Component-aware Graph Convolutional Network for One-shot
Skeleton-based Action Recognition [54.23513799338309]
We present an Adaptive Local-Component-aware Graph Convolutional Network for skeleton-based action recognition.
Our method provides a stronger representation than the global embedding and helps our model reach state-of-the-art.
arXiv Detail & Related papers (2022-09-21T02:33:07Z) - Representation Learning for the Automatic Indexing of Sound Effects
Libraries [79.68916470119743]
We show that a task-specific but dataset-independent representation can successfully address data issues such as class imbalance, inconsistent class labels, and insufficient dataset size.
Detailed experimental results show the impact of metric learning approaches and different cross-dataset training methods on representational effectiveness.
arXiv Detail & Related papers (2022-08-18T23:46:13Z) - Multi-task Learning with Metadata for Music Mood Classification [0.0]
Mood recognition is an important problem in music informatics and has key applications in music discovery and recommendation.
We propose a multi-task learning approach in which a shared model is simultaneously trained for mood and metadata prediction tasks.
Applying our technique on the existing state-of-the-art convolutional neural networks for mood classification improves their performances consistently.
arXiv Detail & Related papers (2021-10-10T11:36:34Z) - Unsupervised Learning of Deep Features for Music Segmentation [8.528384027684192]
Music segmentation is a problem of identifying boundaries between, and labeling, distinct music segments.
The performance of a range of music segmentation algorithms has been dependent on the audio features chosen to represent the audio.
In this work, unsupervised training of deep feature embeddings using convolutional neural networks (CNNs) is explored for music segmentation.
arXiv Detail & Related papers (2021-08-30T01:55:44Z) - dMelodies: A Music Dataset for Disentanglement Learning [70.90415511736089]
We present a new symbolic music dataset that will help researchers demonstrate the efficacy of their algorithms on diverse domains.
This will also provide a means for evaluating algorithms specifically designed for music.
The dataset is large enough (approx. 1.3 million data points) to train and test deep networks for disentanglement learning.
arXiv Detail & Related papers (2020-07-29T19:20:07Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.