Music Auto-Tagging with Robust Music Representation Learned via Domain
Adversarial Training
- URL: http://arxiv.org/abs/2401.15323v1
- Date: Sat, 27 Jan 2024 06:56:51 GMT
- Title: Music Auto-Tagging with Robust Music Representation Learned via Domain
Adversarial Training
- Authors: Haesun Joung, Kyogu Lee
- Abstract summary: Existing models in Music Information Retrieval (MIR) struggle with real-world noise such as environmental and speech sounds in multimedia content.
This study proposes a method inspired by speech-related tasks to enhance music auto-tagging performance in noisy settings.
- Score: 18.71152526968065
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Music auto-tagging is crucial for enhancing music discovery and
recommendation. Existing models in Music Information Retrieval (MIR) struggle
with real-world noise such as environmental and speech sounds in multimedia
content. This study proposes a method inspired by speech-related tasks to
enhance music auto-tagging performance in noisy settings. The approach
integrates Domain Adversarial Training (DAT) into the music domain, enabling
robust music representations that withstand noise. Unlike previous research,
this approach involves an additional pretraining phase for the domain
classifier, to avoid performance degradation in the subsequent phase. Adding
various synthesized noisy music data improves the model's generalization across
different noise levels. The proposed architecture demonstrates enhanced
performance in music auto-tagging by effectively utilizing unlabeled noisy
music data. Additional experiments with supplementary unlabeled data further
improves the model's performance, underscoring its robust generalization
capabilities and broad applicability.
Related papers
- Music Foundation Model as Generic Booster for Music Downstream Tasks [26.09067595520842]
We introduce SoniDo, a music foundation model (MFM) designed to extract hierarchical features from target music samples.
By leveraging hierarchical intermediate features, SoniDo constrains the information granularity, leading to improved performance across various downstream tasks.
arXiv Detail & Related papers (2024-11-02T04:44:27Z) - MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models [57.47799823804519]
We are inspired by how musicians compose music not just from a movie script, but also through visualizations.
We propose MeLFusion, a model that can effectively use cues from a textual description and the corresponding image to synthesize music.
Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music.
arXiv Detail & Related papers (2024-06-07T06:38:59Z) - MuPT: A Generative Symbolic Music Pretrained Transformer [56.09299510129221]
We explore the application of Large Language Models (LLMs) to the pre-training of music.
To address the challenges associated with misaligned measures from different tracks during generation, we propose a Synchronized Multi-Track ABC Notation (SMT-ABC Notation)
Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set.
arXiv Detail & Related papers (2024-04-09T15:35:52Z) - DITTO: Diffusion Inference-Time T-Optimization for Music Generation [49.90109850026932]
Diffusion Inference-Time T-Optimization (DITTO) is a frame-work for controlling pre-trained text-to-music diffusion models at inference-time.
We demonstrate a surprisingly wide-range of applications for music generation including inpainting, outpainting, and looping as well as intensity, melody, and musical structure control.
arXiv Detail & Related papers (2024-01-22T18:10:10Z) - On the Effect of Data-Augmentation on Local Embedding Properties in the
Contrastive Learning of Music Audio Representations [6.255143207183722]
We show that musical properties that are homogeneous within a track are reflected in the locality of neighborhoods in the resulting embedding space.
We show that the optimal selection of data augmentation strategies for contrastive learning of music audio embeddings is dependent on the downstream task.
arXiv Detail & Related papers (2024-01-17T00:12:13Z) - Exploiting Time-Frequency Conformers for Music Audio Enhancement [21.243039524049614]
We propose a music enhancement system based on the Conformer architecture.
Our approach explores the attention mechanisms of the Conformer and examines their performance to discover the best approach for the music enhancement task.
arXiv Detail & Related papers (2023-08-24T06:56:54Z) - MARBLE: Music Audio Representation Benchmark for Universal Evaluation [79.25065218663458]
We introduce the Music Audio Representation Benchmark for universaL Evaluation, termed MARBLE.
It aims to provide a benchmark for various Music Information Retrieval (MIR) tasks by defining a comprehensive taxonomy with four hierarchy levels, including acoustic, performance, score, and high-level description.
We then establish a unified protocol based on 14 tasks on 8 public-available datasets, providing a fair and standard assessment of representations of all open-sourced pre-trained models developed on music recordings as baselines.
arXiv Detail & Related papers (2023-06-18T12:56:46Z) - MusCaps: Generating Captions for Music Audio [14.335950077921435]
We present the first music audio captioning model, MusCaps, consisting of an encoder-decoder with temporal attention.
Our method combines convolutional and recurrent neural network architectures to jointly process audio-text inputs.
Our model represents a shift away from classification-based music description and combines tasks requiring both auditory and linguistic understanding.
arXiv Detail & Related papers (2021-04-24T16:34:47Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z) - Learning Style-Aware Symbolic Music Representations by Adversarial
Autoencoders [9.923470453197657]
We focus on leveraging adversarial regularization as a flexible and natural mean to imbue variational autoencoders with context information.
We introduce the first Music Adversarial Autoencoder (MusAE)
Our model has a higher reconstruction accuracy than state-of-the-art models based on standard variational autoencoders.
arXiv Detail & Related papers (2020-01-15T18:07:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.