Synthia's Melody: A Benchmark Framework for Unsupervised Domain
Adaptation in Audio
- URL: http://arxiv.org/abs/2309.15024v1
- Date: Tue, 26 Sep 2023 15:46:06 GMT
- Title: Synthia's Melody: A Benchmark Framework for Unsupervised Domain
Adaptation in Audio
- Authors: Chia-Hsin Lin, Charles Jones, Bj\"orn W. Schuller, Harry Coppock
- Abstract summary: We present Synthia's melody, a novel audio data generation framework capable of simulating an infinite variety of 4-second melodies.
Unlike existing datasets collected under observational settings, Synthia's melody is free of unobserved biases.
Our evaluations reveal that Synthia's melody provides a robust testbed for examining the susceptibility of these models to varying levels of distribution shift.
- Score: 4.537310370334197
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Despite significant advancements in deep learning for vision and natural
language, unsupervised domain adaptation in audio remains relatively
unexplored. We, in part, attribute this to the lack of an appropriate benchmark
dataset. To address this gap, we present Synthia's melody, a novel audio data
generation framework capable of simulating an infinite variety of 4-second
melodies with user-specified confounding structures characterised by musical
keys, timbre, and loudness. Unlike existing datasets collected under
observational settings, Synthia's melody is free of unobserved biases, ensuring
the reproducibility and comparability of experiments. To showcase its utility,
we generate two types of distribution shifts-domain shift and sample selection
bias-and evaluate the performance of acoustic deep learning models under these
shifts. Our evaluations reveal that Synthia's melody provides a robust testbed
for examining the susceptibility of these models to varying levels of
distribution shift.
Related papers
- Towards Improving Harmonic Sensitivity and Prediction Stability for
Singing Melody Extraction [36.45127093978295]
We propose an input feature modification and a training objective modification based on two assumptions.
To enhance the model's sensitivity on the trailing harmonics, we modify the Combined Frequency and Periodicity representation using discrete z-transform.
We apply these modifications to several models, including MSNet, FTANet, and a newly introduced model, PianoNet, modified from a piano transcription network.
arXiv Detail & Related papers (2023-08-04T21:59:40Z) - Self-Supervised Visual Acoustic Matching [63.492168778869726]
Acoustic matching aims to re-synthesize an audio clip to sound as if it were recorded in a target acoustic environment.
We propose a self-supervised approach to visual acoustic matching where training samples include only the target scene image and audio.
Our approach jointly learns to disentangle room acoustics and re-synthesize audio into the target environment, via a conditional GAN framework and a novel metric.
arXiv Detail & Related papers (2023-07-27T17:59:59Z) - Improving the Robustness of Summarization Systems with Dual Augmentation [68.53139002203118]
A robust summarization system should be able to capture the gist of the document, regardless of the specific word choices or noise in the input.
We first explore the summarization models' robustness against perturbations including word-level synonym substitution and noise.
We propose a SummAttacker, which is an efficient approach to generating adversarial samples based on language models.
arXiv Detail & Related papers (2023-06-01T19:04:17Z) - Training a Deep Neural Network via Policy Gradients for Blind Source
Separation in Polyphonic Music Recordings [1.933681537640272]
We propose a method for the blind separation of sounds of musical instruments in audio signals.
We describe the individual tones via a parametric model, training a dictionary to capture the relative amplitudes of the harmonics.
Our algorithm yields high-quality results with particularly low interference on a variety of different audio samples.
arXiv Detail & Related papers (2021-07-09T06:17:04Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z) - Structure-Aware Audio-to-Score Alignment using Progressively Dilated
Convolutional Neural Networks [8.669338893753885]
The identification of structural differences between a music performance and the score is a challenging yet integral step of audio-to-score alignment.
We present a novel method to detect such differences using progressively dilated convolutional neural networks.
arXiv Detail & Related papers (2021-01-31T05:14:58Z) - HpRNet : Incorporating Residual Noise Modeling for Violin in a
Variational Parametric Synthesizer [11.4219428942199]
We introduce a dataset of Carnatic Violin Recordings where bow noise is an integral part of the playing style of higher pitched notes.
We obtain insights about each of the harmonic and residual components of the signal, as well as their interdependence.
arXiv Detail & Related papers (2020-08-19T12:48:32Z) - Timbre latent space: exploration and creative aspects [1.3764085113103222]
Recent studies show the ability of unsupervised models to learn invertible audio representations using Auto-Encoders.
New possibilities for timbre manipulations are enabled with generative neural networks.
arXiv Detail & Related papers (2020-08-04T07:08:04Z) - Score-informed Networks for Music Performance Assessment [64.12728872707446]
Deep neural network-based methods incorporating score information into MPA models have not yet been investigated.
We introduce three different models capable of score-informed performance assessment.
arXiv Detail & Related papers (2020-08-01T07:46:24Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z) - VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation.
We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.