Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music
Transcription
- URL: http://arxiv.org/abs/2309.15717v2
- Date: Wed, 24 Jan 2024 13:43:03 GMT
- Title: Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music
Transcription
- Authors: Frank Cwitkowitz, Kin Wai Cheuk, Woosung Choi, Marco A.
Mart\'inez-Ram\'irez, Keisuke Toyama, Wei-Hsiang Liao, Yuki Mitsufuji
- Abstract summary: Timbre-Trap is a novel framework which unifies music transcription and audio reconstruction.
We train a single autoencoder to simultaneously estimate pitch salience and reconstruct complex spectral coefficients.
We demonstrate that the framework leads to performance comparable to state-of-the-art instrument-agnostic transcription methods.
- Score: 19.228155694144995
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, research on music transcription has focused mainly on
architecture design and instrument-specific data acquisition. With the lack of
availability of diverse datasets, progress is often limited to solo-instrument
tasks such as piano transcription. Several works have explored multi-instrument
transcription as a means to bolster the performance of models on low-resource
tasks, but these methods face the same data availability issues. We propose
Timbre-Trap, a novel framework which unifies music transcription and audio
reconstruction by exploiting the strong separability between pitch and timbre.
We train a single autoencoder to simultaneously estimate pitch salience and
reconstruct complex spectral coefficients, selecting between either output
during the decoding stage via a simple switch mechanism. In this way, the model
learns to produce coefficients corresponding to timbre-less audio, which can be
interpreted as pitch salience. We demonstrate that the framework leads to
performance comparable to state-of-the-art instrument-agnostic transcription
methods, while only requiring a small amount of annotated data.
Related papers
- YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation [15.9795868183084]
Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument.
This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription.
Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors.
arXiv Detail & Related papers (2024-07-05T19:18:33Z) - Multitrack Music Transcription with a Time-Frequency Perceiver [6.617487928813374]
Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously.
We propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription.
arXiv Detail & Related papers (2023-06-19T08:58:26Z) - Simple and Controllable Music Generation [94.61958781346176]
MusicGen is a single Language Model (LM) that operates over several streams of compressed discrete music representation, i.e., tokens.
Unlike prior work, MusicGen is comprised of a single-stage transformer LM together with efficient token interleaving patterns.
arXiv Detail & Related papers (2023-06-08T15:31:05Z) - Transfer of knowledge among instruments in automatic music transcription [2.0305676256390934]
This work shows how to employ easily generated synthesized audio data produced by software synthesizers to train a universal model.
It is a good base for further transfer learning to quickly adapt transcription model for other instruments.
arXiv Detail & Related papers (2023-04-30T08:37:41Z) - Melody transcription via generative pre-training [86.08508957229348]
Key challenge in melody transcription is building methods which can handle broad audio containing any number of instrument ensembles and musical styles.
To confront this challenge, we leverage representations from Jukebox (Dhariwal et al. 2020), a generative model of broad music audio.
We derive a new dataset containing $50$ hours of melody transcriptions from crowdsourced annotations of broad music.
arXiv Detail & Related papers (2022-12-04T18:09:23Z) - Exploring the Efficacy of Pre-trained Checkpoints in Text-to-Music
Generation Task [86.72661027591394]
We generate complete and semantically consistent symbolic music scores from text descriptions.
We explore the efficacy of using publicly available checkpoints for natural language processing in the task of text-to-music generation.
Our experimental results show that the improvement from using pre-trained checkpoints is statistically significant in terms of BLEU score and edit distance similarity.
arXiv Detail & Related papers (2022-11-21T07:19:17Z) - Jointist: Joint Learning for Multi-instrument Transcription and Its
Applications [15.921536323391226]
Jointist is an instrument-aware multi-instrument framework that is capable of transcribing, recognizing, and separating multiple musical instruments from an audio clip.
Jointist consists of the instrument recognition module that conditions the other modules: the transcription module that outputs instrument-specific piano rolls, and the source separation module that utilizes instrument information and transcription results.
arXiv Detail & Related papers (2022-06-22T02:03:01Z) - Symphony Generation with Permutation Invariant Language Model [57.75739773758614]
We present a symbolic symphony music generation solution, SymphonyNet, based on a permutation invariant language model.
A novel transformer decoder architecture is introduced as backbone for modeling extra-long sequences of symphony tokens.
Our empirical results show that our proposed approach can generate coherent, novel, complex and harmonious symphony compared to human composition.
arXiv Detail & Related papers (2022-05-10T13:08:49Z) - MT3: Multi-Task Multitrack Music Transcription [7.5947187537718905]
We show that a general-purpose Transformer model can perform multi-task Automatic Music Transcription (AMT)
We show this unified training framework achieves high-quality transcription results across a range of datasets.
arXiv Detail & Related papers (2021-11-04T17:19:39Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Vector-Quantized Timbre Representation [53.828476137089325]
This paper targets a more flexible synthesis of an individual timbre by learning an approximate decomposition of its spectral properties with a set of generative features.
We introduce an auto-encoder with a discrete latent space that is disentangled from loudness in order to learn a quantized representation of a given timbre distribution.
We detail results for translating audio between orchestral instruments and singing voice, as well as transfers from vocal imitations to instruments.
arXiv Detail & Related papers (2020-07-13T12:35:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.