Frame-Level Multi-Label Playing Technique Detection Using Multi-Scale
Network and Self-Attention Mechanism
- URL: http://arxiv.org/abs/2303.13272v1
- Date: Thu, 23 Mar 2023 13:52:42 GMT
- Title: Frame-Level Multi-Label Playing Technique Detection Using Multi-Scale
Network and Self-Attention Mechanism
- Authors: Dichucheng Li, Mingjin Che, Wenwu Meng, Yulun Wu, Yi Yu, Fan Xia, Wei
Li
- Abstract summary: We formulate a frame-level multi-label classification problem and apply it to Guzheng, a Chinese plucked string instrument.
Because different IPTs vary a lot in their lengths, we propose a new method to solve this problem using multi-scale network and self-attention.
Our approach outperforms existing works by a large margin, indicating its effectiveness in IPT detection.
- Score: 6.2680838592065715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Instrument playing technique (IPT) is a key element of musical presentation.
However, most of the existing works for IPT detection only concern monophonic
music signals, yet little has been done to detect IPTs in polyphonic
instrumental solo pieces with overlapping IPTs or mixed IPTs. In this paper, we
formulate it as a frame-level multi-label classification problem and apply it
to Guzheng, a Chinese plucked string instrument. We create a new dataset,
Guzheng\_Tech99, containing Guzheng recordings and onset, offset, pitch, IPT
annotations of each note. Because different IPTs vary a lot in their lengths,
we propose a new method to solve this problem using multi-scale network and
self-attention. The multi-scale network extracts features from different
scales, and the self-attention mechanism applied to the feature maps at the
coarsest scale further enhances the long-range feature extraction. Our approach
outperforms existing works by a large margin, indicating its effectiveness in
IPT detection.
Related papers
- PPT: Pretraining with Pseudo-Labeled Trajectories for Motion Forecasting [90.47748423913369]
State-of-the-art motion forecasting models rely on large curated datasets with manually annotated or heavily post-processed trajectories.
PWT is a simple and scalable alternative that uses unprocessed and diverse trajectories automatically generated from off-the-shelf 3D detectors and tracking.
It achieves strong performance across standard benchmarks particularly in low-data regimes, and in cross-domain, end-to-end and multi-class settings.
arXiv Detail & Related papers (2024-12-09T13:48:15Z) - LC-Protonets: Multi-label Few-shot learning for world music audio tagging [65.72891334156706]
We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification.
LC-Protonets generate one prototype per label combination, derived from the power set of labels present in the limited training items.
Our method is applied to automatic audio tagging across diverse music datasets, covering various cultures and including both modern and traditional music.
arXiv Detail & Related papers (2024-09-17T15:13:07Z) - OTSeg: Multi-prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation [57.84148140637513]
Multi-Prompts Sinkhorn Attention (MPSA) effectively replaces cross-attention mechanisms within Transformer framework in multimodal settings.
OTSeg achieves state-of-the-art (SOTA) performance with significant gains on Zero-Shot Semantic (ZS3) tasks.
arXiv Detail & Related papers (2024-03-21T07:15:37Z) - MERTech: Instrument Playing Technique Detection Using Self-Supervised
Pretrained Model With Multi-Task Finetuning [17.307289537499184]
We propose to apply a self-supervised learning model pre-trained on large-scale unlabeled music data and finetune it on IPT detection tasks.
Our method outperforms prior approaches in both frame-level and event-level metrics across multiple IPT benchmark datasets.
arXiv Detail & Related papers (2023-10-15T15:00:00Z) - Multi-Resolution Audio-Visual Feature Fusion for Temporal Action
Localization [8.633822294082943]
This paper introduces the Multi-Resolution Audio-Visual Feature Fusion (MRAV-FF)
MRAV-FF is an innovative method to merge audio-visual data across different temporal resolutions.
arXiv Detail & Related papers (2023-10-05T10:54:33Z) - Playing Technique Detection by Fusing Note Onset Information in Guzheng
Performance [10.755276589673434]
We propose an end-to-end Guzheng playing technique detection system using Fully Convolutional Networks.
Our approach achieves 87.97% in frame-level accuracy and 80.76% in note-level F1-score, outperforming existing works by a large margin.
arXiv Detail & Related papers (2022-09-19T06:02:37Z) - A Lightweight Instrument-Agnostic Model for Polyphonic Note
Transcription and Multipitch Estimation [6.131772929312604]
We propose a lightweight neural network for musical instrument transcription.
Our model is trained to jointly predict frame-wise onsets, multipitch and note activations.
benchmark results show our system's note estimation to be substantially better than a comparable baseline.
arXiv Detail & Related papers (2022-03-18T12:07:36Z) - MFNet: Multi-filter Directive Network for Weakly Supervised Salient
Object Detection [104.0177412274975]
Weakly supervised salient object detection (WSOD) targets to train a CNNs-based saliency network using only low-cost annotations.
Existing WSOD methods take various techniques to pursue single "high-quality" pseudo label from low-cost annotations and then develop their saliency networks.
We introduce a new multiple-pseudo-label framework to integrate more comprehensive and accurate saliency cues from multiple labels.
arXiv Detail & Related papers (2021-12-03T06:12:42Z) - Automatic Polyp Segmentation via Multi-scale Subtraction Network [100.94922587360871]
In clinical practice, precise polyp segmentation provides important information in the early detection of colorectal cancer.
Most existing methods are based on U-shape structure and use element-wise addition or concatenation to fuse different level features progressively in decoder.
We propose a multi-scale subtraction network (MSNet) to segment polyp from colonoscopy image.
arXiv Detail & Related papers (2021-08-11T07:54:07Z) - Pitch-Informed Instrument Assignment Using a Deep Convolutional Network
with Multiple Kernel Shapes [22.14133334414372]
This paper proposes a deep convolutional neural network for performing note-level instrument assignment.
Experiments on the MusicNet dataset using 7 instrument classes show that our approach is able to achieve an average F-score of 0.904.
arXiv Detail & Related papers (2021-07-28T19:48:09Z) - Fast accuracy estimation of deep learning based multi-class musical
source separation [79.10962538141445]
We propose a method to evaluate the separability of instruments in any dataset without training and tuning a neural network.
Based on the oracle principle with an ideal ratio mask, our approach is an excellent proxy to estimate the separation performances of state-of-the-art deep learning approaches.
arXiv Detail & Related papers (2020-10-19T13:05:08Z) - Multi-Scale Positive Sample Refinement for Few-Shot Object Detection [61.60255654558682]
Few-shot object detection (FSOD) helps detectors adapt to unseen classes with few training instances.
We propose a Multi-scale Positive Sample Refinement (MPSR) approach to enrich object scales in FSOD.
MPSR generates multi-scale positive samples as object pyramids and refines the prediction at various scales.
arXiv Detail & Related papers (2020-07-18T09:48:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.