Related papers: Contrastive timbre representations for musical instrument and synthesizer retrieval

Contrastive timbre representations for musical instrument and synthesizer retrieval

URL: http://arxiv.org/abs/2509.13285v1
Date: Tue, 16 Sep 2025 17:38:35 GMT
Title: Contrastive timbre representations for musical instrument and synthesizer retrieval
Authors: Gwendal Le Vaillant, Yannick Molle,
Abstract summary: This paper introduces a contrastive learning framework for musical instrument retrieval.<n>It enables direct querying of instrument databases using a single model for both single- and multi-instrument sounds.
Score: 1.3750624267664158
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Efficiently retrieving specific instrument timbres from audio mixtures remains a challenge in digital music production. This paper introduces a contrastive learning framework for musical instrument retrieval, enabling direct querying of instrument databases using a single model for both single- and multi-instrument sounds. We propose techniques to generate realistic positive/negative pairs of sounds for virtual musical instruments, such as samplers and synthesizers, addressing limitations in common audio data augmentation methods. The first experiment focuses on instrument retrieval from a dataset of 3,884 instruments, using single-instrument audio as input. Contrastive approaches are competitive with previous works based on classification pre-training. The second experiment considers multi-instrument retrieval with a mixture of instruments as audio input. In this case, the proposed contrastive framework outperforms related works, achieving 81.7\% top-1 and 95.7\% top-5 accuracies for three-instrument mixtures.

Related papers

Music Boomerang: Reusing Diffusion Models for Data Augmentation and Audio Manipulation [49.062766449989525]
Generative models of music audio are typically used to generate output based solely on a text prompt or melody.<n>Boomerang sampling, recently proposed for the image domain, allows generating output close to an existing example, using any pretrained diffusion model.
arXiv Detail & Related papers (2025-07-07T10:46:07Z)
Unleashing the Power of Natural Audio Featuring Multiple Sound Sources [54.38251699625379]
Universal sound separation aims to extract clean audio tracks corresponding to distinct events from mixed audio.<n>We propose ClearSep, a framework that employs a data engine to decompose complex naturally mixed audio into multiple independent tracks.<n>In experiments, ClearSep achieves state-of-the-art performance across multiple sound separation tasks.
arXiv Detail & Related papers (2025-04-24T17:58:21Z)
Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models [2.3749120526936465]
We propose and investigate the use of neural audio language models for the automatic generation of sample-based musical instruments. Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding.
arXiv Detail & Related papers (2024-07-22T13:59:58Z)
Anomalous Sound Detection using Audio Representation with Machine ID based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample. The proposed two-stage method uses contrastive learning to pretrain the audio representation model. Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z)
Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions. We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation. Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z)
Show Me the Instruments: Musical Instrument Retrieval from Mixture Audio [11.941510958668557]
We call this task as Musical Instrument Retrieval. We propose a method for retrieving desired musical instruments using reference music mixture as a query. The proposed model consists of the Single-Instrument and the Multi-Instrument, both based on convolutional neural networks.
arXiv Detail & Related papers (2022-11-15T07:32:39Z)
Multi-instrument Music Synthesis with Spectrogram Diffusion [19.81982315173444]
We focus on a middle ground of neural synthesizers that can generate audio from MIDI sequences with arbitrary combinations of instruments in realtime. We use a simple two-stage process: MIDI to spectrograms with an encoder-decoder Transformer, then spectrograms to audio with a generative adversarial network (GAN) spectrogram inverter. We find this to be a promising first step towards interactive and expressive neural synthesis for arbitrary combinations of instruments and notes.
arXiv Detail & Related papers (2022-06-11T03:26:15Z)
Towards Automatic Instrumentation by Learning to Separate Parts in Symbolic Multitrack Music [33.679951600368405]
We study the feasibility of automatic instrumentation -- dynamically assigning instruments to notes in solo music during performance. In addition to the online, real-time-capable setting for performative use cases, automatic instrumentation can also find applications in assistive composing tools in an offline setting. We frame the task of part separation as a sequential multi-class classification problem and adopt machine learning to map sequences of notes into sequences of part labels.
arXiv Detail & Related papers (2021-07-13T08:34:44Z)
Fast accuracy estimation of deep learning based multi-class musical source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network. Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z)
Time-Frequency Scattering Accurately Models Auditory Similarities Between Instrumental Playing Techniques [5.923588533979649]
We show that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone. We propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques.
arXiv Detail & Related papers (2020-07-21T16:37:15Z)
Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music. We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.