A Lightweight Instrument-Agnostic Model for Polyphonic Note
Transcription and Multipitch Estimation
- URL: http://arxiv.org/abs/2203.09893v1
- Date: Fri, 18 Mar 2022 12:07:36 GMT
- Title: A Lightweight Instrument-Agnostic Model for Polyphonic Note
Transcription and Multipitch Estimation
- Authors: Rachel M. Bittner, Juan Jos\'e Bosch, David Rubinstein, Gabriel
Meseguer-Brocal, Sebastian Ewert
- Abstract summary: We propose a lightweight neural network for musical instrument transcription.
Our model is trained to jointly predict frame-wise onsets, multipitch and note activations.
benchmark results show our system's note estimation to be substantially better than a comparable baseline.
- Score: 6.131772929312604
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Automatic Music Transcription (AMT) has been recognized as a key enabling
technology with a wide range of applications. Given the task's complexity, best
results have typically been reported for systems focusing on specific settings,
e.g. instrument-specific systems tend to yield improved results over
instrument-agnostic methods. Similarly, higher accuracy can be obtained when
only estimating frame-wise $f_0$ values and neglecting the harder note event
detection. Despite their high accuracy, such specialized systems often cannot
be deployed in the real-world. Storage and network constraints prohibit the use
of multiple specialized models, while memory and run-time constraints limit
their complexity. In this paper, we propose a lightweight neural network for
musical instrument transcription, which supports polyphonic outputs and
generalizes to a wide variety of instruments (including vocals). Our model is
trained to jointly predict frame-wise onsets, multipitch and note activations,
and we experimentally show that this multi-output structure improves the
resulting frame-level note accuracy. Despite its simplicity, benchmark results
show our system's note estimation to be substantially better than a comparable
baseline, and its frame-level accuracy to be only marginally below those of
specialized state-of-the-art AMT systems. With this work we hope to encourage
the community to further investigate low-resource, instrument-agnostic AMT
systems.
Related papers
- LC-Protonets: Multi-label Few-shot learning for world music audio tagging [65.72891334156706]
We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification.
LC-Protonets generate one prototype per label combination, derived from the power set of labels present in the limited training items.
Our method is applied to automatic audio tagging across diverse music datasets, covering various cultures and including both modern and traditional music.
arXiv Detail & Related papers (2024-09-17T15:13:07Z) - Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music
Transcription [19.228155694144995]
Timbre-Trap is a novel framework which unifies music transcription and audio reconstruction.
We train a single autoencoder to simultaneously estimate pitch salience and reconstruct complex spectral coefficients.
We demonstrate that the framework leads to performance comparable to state-of-the-art instrument-agnostic transcription methods.
arXiv Detail & Related papers (2023-09-27T15:19:05Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - Multitrack Music Transcription with a Time-Frequency Perceiver [6.617487928813374]
Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously.
We propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription.
arXiv Detail & Related papers (2023-06-19T08:58:26Z) - Classical Ensembles of Single-Qubit Quantum Variational Circuits for
Classification [0.0]
Quantumally universal multi-feature (QAUM) encoding architecture was recently introduced and showed improved expressivity and performance in classifying pulsar stars.
This work reports on the design, implementation, and evaluation of ensembles of single-qubit QAUM classifiers using classical bagging and boosting techniques.
arXiv Detail & Related papers (2023-02-06T17:51:47Z) - Fully Automated End-to-End Fake Audio Detection [57.78459588263812]
This paper proposes a fully automated end-toend fake audio detection method.
We first use wav2vec pre-trained model to obtain a high-level representation of the speech.
For the network structure, we use a modified version of the differentiable architecture search (DARTS) named light-DARTS.
arXiv Detail & Related papers (2022-08-20T06:46:55Z) - Investigation of Different Calibration Methods for Deep Speaker
Embedding based Verification Systems [66.61691401921296]
This paper presents an investigation over several methods of score calibration for deep speaker embedding extractors.
An additional focus of this research is to estimate the impact of score normalization on the calibration performance of the system.
arXiv Detail & Related papers (2022-03-28T21:22:22Z) - Deep-Learning Architectures for Multi-Pitch Estimation: Towards Reliable
Evaluation [7.599399338954308]
Multi-pitch estimation aims for detecting the simultaneous activity of pitches in polyphonic music recordings.
In this paper, we realize different architectures based on CNNs, the U-net structure, and self-attention components.
We compare variants of these architectures in different sizes for multi-pitch estimation using the MusicNet and Schubert Winterreise datasets.
arXiv Detail & Related papers (2022-02-18T13:52:21Z) - Machine Learning-enhanced Receive Processing for MU-MIMO OFDM Systems [15.423422040627331]
Machine learning can be used to improve multi-user multiple-input multiple-output (MU-MIMO) receive processing.
We propose a new strategy which preserves the benefits of a conventional receiver, but enhances specific parts with ML components.
arXiv Detail & Related papers (2021-06-30T14:02:27Z) - End-to-End Object Detection with Fully Convolutional Network [71.56728221604158]
We introduce a Prediction-aware One-To-One (POTO) label assignment for classification to enable end-to-end detection.
A simple 3D Max Filtering (3DMF) is proposed to utilize the multi-scale features and improve the discriminability of convolutions in the local region.
Our end-to-end framework achieves competitive performance against many state-of-the-art detectors with NMS on COCO and CrowdHuman datasets.
arXiv Detail & Related papers (2020-12-07T09:14:55Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.