Barwise Compression Schemes for Audio-Based Music Structure Analysis
- URL: http://arxiv.org/abs/2202.04981v1
- Date: Thu, 10 Feb 2022 12:23:57 GMT
- Title: Barwise Compression Schemes for Audio-Based Music Structure Analysis
- Authors: Axel Marmoret, J\'er\'emy E. Cohen, Fr\'ed\'eric Bimbot
- Abstract summary: Music Structure Analysis (MSA) consists in segmenting a music piece in several distinct sections.
We approach MSA within a compression framework, under the hypothesis that the structure is more easily revealed by a simplified representation of the original content of the song.
In our experiments, several unsupervised compression schemes achieve a level of performance comparable to that of state-of-the-art supervised methods.
- Score: 4.39160562548524
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Music Structure Analysis (MSA) consists in segmenting a music piece in
several distinct sections. We approach MSA within a compression framework,
under the hypothesis that the structure is more easily revealed by a simplified
representation of the original content of the song.
More specifically, under the hypothesis that MSA is correlated with
similarities occurring at the bar scale, linear and non-linear compression
schemes can be applied to barwise audio signals. Compressed representations
capture the most salient components of the different bars in the song and are
then used to infer the song structure using a dynamic programming algorithm.
This work explores both low-rank approximation models such as Principal
Component Analysis or Nonnegative Matrix Factorization and "piece-specific"
Auto-Encoding Neural Networks, with the objective to learn latent
representations specific to a given song. Such approaches do not rely on
supervision nor annotations, which are well-known to be tedious to collect and
possibly ambiguous in MSA description.
In our experiments, several unsupervised compression schemes achieve a level
of performance comparable to that of state-of-the-art supervised methods (for
3s tolerance) on the RWC-Pop dataset, showcasing the importance of the barwise
compression processing for MSA.
Related papers
- A Refreshed Similarity-based Upsampler for Direct High-Ratio Feature Upsampling [54.05517338122698]
We propose an explicitly controllable query-key feature alignment from both semantic-aware and detail-aware perspectives.
We also develop a fine-grained neighbor selection strategy on HR features, which is simple yet effective for alleviating mosaic artifacts.
Our proposed ReSFU framework consistently achieves satisfactory performance on different segmentation applications.
arXiv Detail & Related papers (2024-07-02T14:12:21Z) - Self-Similarity-Based and Novelty-based loss for music structure
analysis [5.3900692419866285]
We propose a supervised approach for the task of music boundary detection.
In our approach we simultaneously learn features and convolution kernels.
We demonstrate that relative feature learning, through self-attention, is beneficial for the task of MSA.
arXiv Detail & Related papers (2023-09-05T13:49:29Z) - Visually-Guided Sound Source Separation with Audio-Visual Predictive
Coding [57.08832099075793]
Visually-guided sound source separation consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing.
This paper presents audio-visual predictive coding (AVPC) to tackle this task in parameter harmonizing and more effective manner.
In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source.
arXiv Detail & Related papers (2023-06-19T03:10:57Z) - Symbolic Music Structure Analysis with Graph Representations and
Changepoint Detection Methods [1.1677169430445211]
We propose three methods to segment symbolic music by its form or structure: Norm, G-PELT and G-Window.
We have found that encoding symbolic music with graph representations and computing the novelty of Adjacency Matrices represent the structure of symbolic music pieces well.
arXiv Detail & Related papers (2023-03-24T09:45:11Z) - SegViT: Semantic Segmentation with Plain Vision Transformers [91.50075506561598]
We explore the capability of plain Vision Transformers (ViTs) for semantic segmentation.
We propose the Attention-to-Mask (ATM) module, in which similarity maps between a set of learnable class tokens and the spatial feature maps are transferred to the segmentation masks.
Experiments show that our proposed SegVit using the ATM module outperforms its counterparts using the plain ViT backbone.
arXiv Detail & Related papers (2022-10-12T00:30:26Z) - Self-Supervised Representation Learning With MUlti-Segmental
Informational Coding (MUSIC) [6.693379403133435]
Self-supervised representation learning maps high-dimensional data into a meaningful embedding space.
We propose MUlti-Segmental Informational Coding (MUSIC) for self-supervised representation learning.
arXiv Detail & Related papers (2022-06-13T20:37:48Z) - Exploring single-song autoencoding schemes for audio-based music
structure analysis [6.037383467521294]
This work explores a "piece-specific" autoencoding scheme, in which a low-dimensional autoencoder is trained to learn a latent/compressed representation specific to a given song.
We report that the proposed unsupervised auto-encoding scheme achieves the level of performance of supervised state-of-the-art methods with 3 seconds tolerance.
arXiv Detail & Related papers (2021-10-27T13:48:25Z) - Video Coding for Machine: Compact Visual Representation Compression for
Intelligent Collaborative Analytics [101.35754364753409]
Video Coding for Machines (VCM) is committed to bridging to an extent separate research tracks of video/image compression and feature compression.
This paper summarizes VCM methodology and philosophy based on existing academia and industrial efforts.
arXiv Detail & Related papers (2021-10-18T12:42:13Z) - Revisit Visual Representation in Analytics Taxonomy: A Compression
Perspective [69.99087941471882]
We study the problem of supporting multiple machine vision analytics tasks with the compressed visual representation.
By utilizing the intrinsic transferability among different tasks, our framework successfully constructs compact and expressive representations at low bit-rates.
In order to impose compactness in the representations, we propose a codebook-based hyperprior.
arXiv Detail & Related papers (2021-06-16T01:44:32Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Uncovering audio patterns in music with Nonnegative Tucker Decomposition
for structural segmentation [0.0]
The present work investigates the ability of Nonnegative Tucker Decompositon (NTD) to uncover musical patterns and structure in pop songs in their audio form.
Exploiting the fact that NTD tends to express the content of bars as linear combinations of a few patterns, we illustrate the ability of the decomposition to capture and single out repeated motifs in the corresponding compressed space.
arXiv Detail & Related papers (2021-04-17T15:48:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.