MT3: Multi-Task Multitrack Music Transcription
- URL: http://arxiv.org/abs/2111.03017v1
- Date: Thu, 4 Nov 2021 17:19:39 GMT
- Title: MT3: Multi-Task Multitrack Music Transcription
- Authors: Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, Jesse Engel
- Abstract summary: We show that a general-purpose Transformer model can perform multi-task Automatic Music Transcription (AMT)
We show this unified training framework achieves high-quality transcription results across a range of datasets.
- Score: 7.5947187537718905
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic Music Transcription (AMT), inferring musical notes from raw audio,
is a challenging task at the core of music understanding. Unlike Automatic
Speech Recognition (ASR), which typically focuses on the words of a single
speaker, AMT often requires transcribing multiple instruments simultaneously,
all while preserving fine-scale pitch and timing information. Further, many AMT
datasets are "low-resource", as even expert musicians find music transcription
difficult and time-consuming. Thus, prior work has focused on task-specific
architectures, tailored to the individual instruments of each task. In this
work, motivated by the promising results of sequence-to-sequence transfer
learning for low-resource Natural Language Processing (NLP), we demonstrate
that a general-purpose Transformer model can perform multi-task AMT, jointly
transcribing arbitrary combinations of musical instruments across several
transcription datasets. We show this unified training framework achieves
high-quality transcription results across a range of datasets, dramatically
improving performance for low-resource instruments (such as guitar), while
preserving strong performance for abundant instruments (such as piano).
Finally, by expanding the scope of AMT, we expose the need for more consistent
evaluation metrics and better dataset alignment, and provide a strong baseline
for this new direction of multi-task AMT.
Related papers
- Alignment-Free Training for Transducer-based Multi-Talker ASR [55.1234384771616]
Multi-talker RNNT (MT-RNNT) aims to achieve recognition without relying on costly front-end source separation.
We propose a novel alignment-free training scheme for the MT-RNNT (MT-RNNT-AFT) that adopts the standard RNNT architecture.
arXiv Detail & Related papers (2024-09-30T13:58:11Z) - Development of Large Annotated Music Datasets using HMM-based Forced Viterbi Alignment [0.0]
We propose a well streamlined and efficient method for generating datasets for any instrument.
The onsets of the transcriptions are manually verified and the labels are accurate up to 10ms, averaging at 5ms.
This method will aid as a preliminary step towards building concrete datasets for building AMT systems for different instruments.
arXiv Detail & Related papers (2024-08-27T09:06:29Z) - YourMT3+: Multi-instrument Music Transcription with Enhanced Transformer Architectures and Cross-dataset Stem Augmentation [15.9795868183084]
Multi-instrument music transcription aims to convert polyphonic music recordings into musical scores assigned to each instrument.
This paper introduces YourMT3+, a suite of models for enhanced multi-instrument music transcription.
Our experiments demonstrate direct vocal transcription capabilities, eliminating the need for voice separation pre-processors.
arXiv Detail & Related papers (2024-07-05T19:18:33Z) - Meerkat: Audio-Visual Large Language Model for Grounding in Space and Time [73.7845280328535]
We present Meerkat, an audio-visual LLM equipped with a fine-grained understanding of image and audio.
Meerkat can tackle challenging tasks such as audio referred image grounding, image guided audio temporal localization, and audio-visual fact-checking.
We achieve state-of-the-art performance on all these downstream tasks with a relative improvement of up to 37.12%.
arXiv Detail & Related papers (2024-07-01T23:32:25Z) - TMT: Tri-Modal Translation between Speech, Image, and Text by Processing
Different Modalities as Different Languages [96.8603701943286]
Tri-Modal Translation (TMT) model translates between arbitrary modalities spanning speech, image, and text.
We tokenize speech and image data into discrete tokens, which provide a unified interface across modalities.
TMT outperforms single model counterparts consistently.
arXiv Detail & Related papers (2024-02-25T07:46:57Z) - Rethinking and Improving Multi-task Learning for End-to-end Speech
Translation [51.713683037303035]
We investigate the consistency between different tasks, considering different times and modules.
We find that the textual encoder primarily facilitates cross-modal conversion, but the presence of noise in speech impedes the consistency between text and speech representations.
We propose an improved multi-task learning (IMTL) approach for the ST task, which bridges the modal gap by mitigating the difference in length and representation.
arXiv Detail & Related papers (2023-11-07T08:48:46Z) - Timbre-Trap: A Low-Resource Framework for Instrument-Agnostic Music
Transcription [19.228155694144995]
Timbre-Trap is a novel framework which unifies music transcription and audio reconstruction.
We train a single autoencoder to simultaneously estimate pitch salience and reconstruct complex spectral coefficients.
We demonstrate that the framework leads to performance comparable to state-of-the-art instrument-agnostic transcription methods.
arXiv Detail & Related papers (2023-09-27T15:19:05Z) - Multitrack Music Transcription with a Time-Frequency Perceiver [6.617487928813374]
Multitrack music transcription aims to transcribe a music audio input into the musical notes of multiple instruments simultaneously.
We propose a novel deep neural network architecture, Perceiver TF, to model the time-frequency representation of audio input for multitrack transcription.
arXiv Detail & Related papers (2023-06-19T08:58:26Z) - FretNet: Continuous-Valued Pitch Contour Streaming for Polyphonic Guitar
Tablature Transcription [0.34376560669160383]
In certain applications, such as Guitar Tablature Transcription (GTT), it is more meaningful to estimate continuous-valued pitch contours.
We present a GTT formulation that estimates continuous-valued pitch contours, grouping them according to their string and fret of origin.
We demonstrate that for this task, the proposed method significantly improves the resolution of MPE and simultaneously yields tablature estimation results competitive with baseline models.
arXiv Detail & Related papers (2022-12-06T14:51:27Z) - FETA: A Benchmark for Few-Sample Task Transfer in Open-Domain Dialogue [70.65782786401257]
This work explores conversational task transfer by introducing FETA: a benchmark for few-sample task transfer in open-domain dialogue.
FETA contains two underlying sets of conversations upon which there are 10 and 7 tasks annotated, enabling the study of intra-dataset task transfer.
We utilize three popular language models and three learning algorithms to analyze the transferability between 132 source-target task pairs.
arXiv Detail & Related papers (2022-05-12T17:59:00Z) - Multilingual Denoising Pre-training for Neural Machine Translation [132.66750663226287]
mBART is a sequence-to-sequence denoising auto-encoder pre-trained on large-scale monolingual corpora.
mBART is one of the first methods for pre-training a complete sequence-to-sequence model.
arXiv Detail & Related papers (2020-01-22T18:59:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.