Visual Attention for Musical Instrument Recognition
- URL: http://arxiv.org/abs/2006.09640v2
- Date: Sun, 21 Jun 2020 15:53:37 GMT
- Title: Visual Attention for Musical Instrument Recognition
- Authors: Karn Watcharasupat, Siddharth Gururani and Alexander Lerch
- Abstract summary: We explore the use of attention mechanism in a timbral-temporal sense, a la visual attention, to improve the performance of musical instrument recognition.
The first approach applies attention mechanism to the sliding-window paradigm, where a prediction based on each timbral-temporal instance' is given an attention weight, before aggregation to produce the final prediction.
The second approach is based on a recurrent model of visual attention where the network only attends to parts of the spectrogram and decide where to attend to next, given a limited number of glimpses'
- Score: 72.05116221011949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the field of music information retrieval, the task of simultaneously
identifying the presence or absence of multiple musical instruments in a
polyphonic recording remains a hard problem. Previous works have seen some
success in improving instrument classification by applying temporal attention
in a multi-instance multi-label setting, while another series of work has also
suggested the role of pitch and timbre in improving instrument recognition
performance. In this project, we further explore the use of attention mechanism
in a timbral-temporal sense, \`a la visual attention, to improve the
performance of musical instrument recognition using weakly-labeled data. Two
approaches to this task have been explored. The first approach applies
attention mechanism to the sliding-window paradigm, where a prediction based on
each timbral-temporal `instance' is given an attention weight, before
aggregation to produce the final prediction. The second approach is based on a
recurrent model of visual attention where the network only attends to parts of
the spectrogram and decide where to attend to next, given a limited number of
`glimpses'.
Related papers
- Toward a More Complete OMR Solution [49.74172035862698]
Optical music recognition aims to convert music notation into digital formats.
One approach to tackle OMR is through a multi-stage pipeline, where the system first detects visual music notation elements in the image.
We introduce a music object detector based on YOLOv8, which improves detection performance.
Second, we introduce a supervised training pipeline that completes the notation assembly stage based on detection output.
arXiv Detail & Related papers (2024-08-31T01:09:12Z) - Cadence Detection in Symbolic Classical Music using Graph Neural
Networks [7.817685358710508]
We present a graph representation of symbolic scores as an intermediate means to solve the cadence detection task.
We approach cadence detection as an imbalanced node classification problem using a Graph Convolutional Network.
Our experiments suggest that graph convolution can learn non-local features that assist in cadence detection, freeing us from the need of having to devise specialized features that encode non-local context.
arXiv Detail & Related papers (2022-08-31T12:39:57Z) - Self-Attention Neural Bag-of-Features [103.70855797025689]
We build on the recently introduced 2D-Attention and reformulate the attention learning methodology.
We propose a joint feature-temporal attention mechanism that learns a joint 2D attention mask highlighting relevant information.
arXiv Detail & Related papers (2022-01-26T17:54:14Z) - Revisiting spatio-temporal layouts for compositional action recognition [63.04778884595353]
We take an object-centric approach to action recognition.
The main focus of this paper is compositional/few-shot action recognition.
We demonstrate how to improve the performance of appearance-based models by fusion with layout-based models.
arXiv Detail & Related papers (2021-11-02T23:04:39Z) - Recurrent Attention Models with Object-centric Capsule Representation
for Multi-object Recognition [4.143091738981101]
We show that an object-centric hidden representation in an encoder-decoder model with iterative glimpse attention yields effective integration of attention and recognition.
Our work takes a step toward a general architecture for how to integrate recurrent object-centric representation into the planning of attentional glimpses.
arXiv Detail & Related papers (2021-10-11T01:41:21Z) - Counterfactual Attention Learning for Fine-Grained Visual Categorization
and Re-identification [101.49122450005869]
We present a counterfactual attention learning method to learn more effective attention based on causal inference.
Specifically, we analyze the effect of the learned visual attention on network prediction.
We evaluate our method on a wide range of fine-grained recognition tasks.
arXiv Detail & Related papers (2021-08-19T14:53:40Z) - Timbre Classification of Musical Instruments with a Deep Learning
Multi-Head Attention-Based Model [1.7188280334580197]
The aim of this work is to define a model that is able to identify different instrument timbres with as few parameters as possible.
It has been possible to assess the ability to classify instruments by timbre even if the instruments are playing the same note with the same intensity.
arXiv Detail & Related papers (2021-07-13T16:34:19Z) - Audiovisual transfer learning for audio tagging and sound event
detection [21.574781022415372]
We study the merit of transfer learning for two sound recognition problems, i.e., audio tagging and sound event detection.
We adapt a baseline system utilizing only spectral acoustic inputs to make use of pretrained auditory and visual features.
We perform experiments with these modified models on an audiovisual multi-label data set.
arXiv Detail & Related papers (2021-06-09T21:55:05Z) - Music Gesture for Visual Sound Separation [121.36275456396075]
"Music Gesture" is a keypoint-based structured representation to explicitly model the body and finger movements of musicians when they perform music.
We first adopt a context-aware graph network to integrate visual semantic context with body dynamics, and then apply an audio-visual fusion model to associate body movements with the corresponding audio signals.
arXiv Detail & Related papers (2020-04-20T17:53:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.