Composer Style Classification of Piano Sheet Music Images Using Language
Model Pretraining
- URL: http://arxiv.org/abs/2007.14587v1
- Date: Wed, 29 Jul 2020 04:13:59 GMT
- Title: Composer Style Classification of Piano Sheet Music Images Using Language
Model Pretraining
- Authors: TJ Tsai and Kevin Ji
- Abstract summary: We recast the problem to be based on raw sheet music images rather than a symbolic music format.
Our approach first converts the sheet music image into a sequence of musical "words" based on the bootleg feature representation.
We train AWD-LSTM, GPT-2, and RoBERTa language models on all piano sheet music images in IMSLP.
- Score: 16.23438816698455
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies composer style classification of piano sheet music images.
Previous approaches to the composer classification task have been limited by a
scarcity of data. We address this issue in two ways: (1) we recast the problem
to be based on raw sheet music images rather than a symbolic music format, and
(2) we propose an approach that can be trained on unlabeled data. Our approach
first converts the sheet music image into a sequence of musical "words" based
on the bootleg feature representation, and then feeds the sequence into a text
classifier. We show that it is possible to significantly improve classifier
performance by first training a language model on a set of unlabeled data,
initializing the classifier with the pretrained language model weights, and
then finetuning the classifier on a small amount of labeled data. We train
AWD-LSTM, GPT-2, and RoBERTa language models on all piano sheet music images in
IMSLP. We find that transformer-based architectures outperform CNN and LSTM
models, and pretraining boosts classification accuracy for the GPT-2 model from
46\% to 70\% on a 9-way classification task. The trained model can also be used
as a feature extractor that projects piano sheet music into a feature space
that characterizes compositional style.
Related papers
- Audio-to-Score Conversion Model Based on Whisper methodology [0.0]
This thesis innovatively introduces the "Orpheus' Score", a custom notation system that converts music information into tokens.
Experiments show that compared to traditional algorithms, the model has significantly improved accuracy and performance.
arXiv Detail & Related papers (2024-10-22T17:31:37Z) - PBSCR: The Piano Bootleg Score Composer Recognition Dataset [5.314803183185992]
PBSCR is a dataset for studying composer recognition of classical piano music.
It contains 40,000 62x64 bootleg score images for a 9-class recognition task, 100,000 62x64 bootleg score images for a 100-class recognition task, and 29,310 unlabeled variable-length bootleg score images for pretraining.
arXiv Detail & Related papers (2024-01-30T07:50:32Z) - Image-free Classifier Injection for Zero-Shot Classification [72.66409483088995]
Zero-shot learning models achieve remarkable results on image classification for samples from classes that were not seen during training.
We aim to equip pre-trained models with zero-shot classification capabilities without the use of image data.
We achieve this with our proposed Image-free Injection with Semantics (ICIS)
arXiv Detail & Related papers (2023-08-21T09:56:48Z) - Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? [41.56585313495218]
A vision-language model can be adapted to a new classification task through few-shot prompt tuning.
We study the key reasons contributing to the robustness of the prompt tuning paradigm.
We demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt.
arXiv Detail & Related papers (2023-07-22T04:20:30Z) - GIST: Generating Image-Specific Text for Fine-grained Object
Classification [8.118079247462425]
GIST is a method for generating image-specific fine-grained text descriptions from image-only datasets.
Our method achieves an average improvement of $4.1%$ in accuracy over CLIP linear probes.
arXiv Detail & Related papers (2023-07-21T02:47:18Z) - Text Descriptions are Compressive and Invariant Representations for
Visual Learning [63.3464863723631]
We show that an alternative approach, in line with humans' understanding of multiple visual features per class, can provide compelling performance in the robust few-shot learning setting.
In particular, we introduce a novel method, textit SLR-AVD (Sparse Logistic Regression using Augmented Visual Descriptors).
This method first automatically generates multiple visual descriptions of each class via a large language model (LLM), then uses a VLM to translate these descriptions to a set of visual feature embeddings of each image, and finally uses sparse logistic regression to select a relevant subset of these features to classify
arXiv Detail & Related papers (2023-07-10T03:06:45Z) - CLIPSonic: Text-to-Audio Synthesis with Unlabeled Videos and Pretrained
Language-Vision Models [50.42886595228255]
We propose to learn the desired text-audio correspondence by leveraging the visual modality as a bridge.
We train a conditional diffusion model to generate the audio track of a video, given a video frame encoded by a pretrained contrastive language-image pretraining model.
arXiv Detail & Related papers (2023-06-16T05:42:01Z) - GETMusic: Generating Any Music Tracks with a Unified Representation and
Diffusion Framework [58.64512825534638]
Symbolic music generation aims to create musical notes, which can help users compose music.
We introduce a framework known as GETMusic, with GET'' standing for GEnerate music Tracks''
GETScore represents musical notes as tokens and organizes tokens in a 2D structure, with tracks stacked vertically and progressing horizontally over time.
Our proposed representation, coupled with the non-autoregressive generative model, empowers GETMusic to generate music with any arbitrary source-target track combinations.
arXiv Detail & Related papers (2023-05-18T09:53:23Z) - Language Quantized AutoEncoders: Towards Unsupervised Text-Image
Alignment [81.73717488887938]
Language-Quantized AutoEncoder (LQAE) learns to align text-image data in an unsupervised manner by leveraging pretrained language models.
LQAE learns to represent similar images with similar clusters of text tokens, thereby aligning these two modalities without the use of aligned text-image pairs.
This enables few-shot image classification with large language models (e.g., GPT-3) as well as linear classification of images based on BERT text features.
arXiv Detail & Related papers (2023-02-02T06:38:44Z) - Codified audio language modeling learns useful representations for music
information retrieval [77.63657430536593]
We show that language models pre-trained on codified (discretely-encoded) music audio learn representations that are useful for downstream MIR tasks.
To determine if Jukebox's representations contain useful information for MIR, we use them as input features to train shallow models on several MIR tasks.
We observe that representations from Jukebox are considerably stronger than those from models pre-trained on tagging, suggesting that pre-training via codified audio language modeling may address blind spots in conventional approaches.
arXiv Detail & Related papers (2021-07-12T18:28:50Z) - BERT-like Pre-training for Symbolic Piano Music Classification Tasks [15.02723006489356]
This article presents a benchmark study of symbolic piano music classification using the Bidirectional Representations from Transformers (BERT) approach.
We pre-train two 12-layer Transformer models using the BERT approach and fine-tune them for four downstream classification tasks.
Our evaluation shows that the BERT approach leads to higher classification accuracy than recurrent neural network (RNN)-based baselines.
arXiv Detail & Related papers (2021-07-12T07:03:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.