Time-Frequency Scattering Accurately Models Auditory Similarities
Between Instrumental Playing Techniques
- URL: http://arxiv.org/abs/2007.10926v2
- Date: Tue, 10 Nov 2020 17:36:37 GMT
- Title: Time-Frequency Scattering Accurately Models Auditory Similarities
Between Instrumental Playing Techniques
- Authors: Vincent Lostanlen, Christian El-Hajj, Mathias Rossignol, Gr\'egoire
Lafay, Joakim And\'en and Mathieu Lagrange
- Abstract summary: We show that timbre perception operates within a more flexible taxonomy than those provided by instruments or playing techniques alone.
We propose a machine listening model to recover the cluster graph of auditory similarities across instruments, mutes, and techniques.
- Score: 5.923588533979649
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instrumental playing techniques such as vibratos, glissandos, and trills
often denote musical expressivity, both in classical and folk contexts.
However, most existing approaches to music similarity retrieval fail to
describe timbre beyond the so-called "ordinary" technique, use instrument
identity as a proxy for timbre quality, and do not allow for customization to
the perceptual idiosyncrasies of a new subject. In this article, we ask 31
human subjects to organize 78 isolated notes into a set of timbre clusters.
Analyzing their responses suggests that timbre perception operates within a
more flexible taxonomy than those provided by instruments or playing techniques
alone. In addition, we propose a machine listening model to recover the cluster
graph of auditory similarities across instruments, mutes, and techniques. Our
model relies on joint time--frequency scattering features to extract
spectrotemporal modulations as acoustic features. Furthermore, it minimizes
triplet loss in the cluster graph by means of the large-margin nearest neighbor
(LMNN) metric learning algorithm. Over a dataset of 9346 isolated notes, we
report a state-of-the-art average precision at rank five (AP@5) of
$99.0\%\pm1$. An ablation study demonstrates that removing either the joint
time--frequency scattering transform or the metric learning algorithm
noticeably degrades performance.
Related papers
- Why Perturbing Symbolic Music is Necessary: Fitting the Distribution of Never-used Notes through a Joint Probabilistic Diffusion Model [6.085444830169205]
Existing music generation models are mostly language-based, neglecting the frequency continuity property of notes.
We introduce the Music-Diff architecture, which fits a joint distribution of notes and semantic information to generate symbolic music conditionally.
arXiv Detail & Related papers (2024-08-04T07:38:38Z) - Generating Sample-Based Musical Instruments Using Neural Audio Codec Language Models [2.3749120526936465]
We propose and investigate the use of neural audio language models for the automatic generation of sample-based musical instruments.
Our approach extends a generative audio framework to condition on pitch across an 88-key spectrum, velocity, and a combined text/audio embedding.
arXiv Detail & Related papers (2024-07-22T13:59:58Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Synthia's Melody: A Benchmark Framework for Unsupervised Domain
Adaptation in Audio [4.537310370334197]
We present Synthia's melody, a novel audio data generation framework capable of simulating an infinite variety of 4-second melodies.
Unlike existing datasets collected under observational settings, Synthia's melody is free of unobserved biases.
Our evaluations reveal that Synthia's melody provides a robust testbed for examining the susceptibility of these models to varying levels of distribution shift.
arXiv Detail & Related papers (2023-09-26T15:46:06Z) - Self-supervised Auxiliary Loss for Metric Learning in Music
Similarity-based Retrieval and Auto-tagging [0.0]
We propose a model that builds on the self-supervised learning approach to address the similarity-based retrieval challenge.
We also found that refraining from employing augmentation during the fine-tuning phase yields better results.
arXiv Detail & Related papers (2023-04-15T02:00:28Z) - Anomalous Sound Detection using Audio Representation with Machine ID
based Contrastive Learning Pretraining [52.191658157204856]
This paper uses contrastive learning to refine audio representations for each machine ID, rather than for each audio sample.
The proposed two-stage method uses contrastive learning to pretrain the audio representation model.
Experiments show that our method outperforms the state-of-the-art methods using contrastive learning or self-supervised classification.
arXiv Detail & Related papers (2023-04-07T11:08:31Z) - Anomaly Transformer: Time Series Anomaly Detection with Association
Discrepancy [68.86835407617778]
Anomaly Transformer achieves state-of-the-art performance on six unsupervised time series anomaly detection benchmarks.
Anomaly Transformer achieves state-of-the-art performance on six unsupervised time series anomaly detection benchmarks.
arXiv Detail & Related papers (2021-10-06T10:33:55Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z) - Visual Attention for Musical Instrument Recognition [72.05116221011949]
We explore the use of attention mechanism in a timbral-temporal sense, a la visual attention, to improve the performance of musical instrument recognition.
The first approach applies attention mechanism to the sliding-window paradigm, where a prediction based on each timbral-temporal instance' is given an attention weight, before aggregation to produce the final prediction.
The second approach is based on a recurrent model of visual attention where the network only attends to parts of the spectrogram and decide where to attend to next, given a limited number of glimpses'
arXiv Detail & Related papers (2020-06-17T03:56:44Z) - Audio Impairment Recognition Using a Correlation-Based Feature
Representation [85.08880949780894]
We propose a new representation of hand-crafted features that is based on the correlation of feature pairs.
We show superior performance in terms of compact feature dimensionality and improved computational speed in the test stage.
arXiv Detail & Related papers (2020-03-22T13:34:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.