Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models
- URL: http://arxiv.org/abs/2404.06818v1
- Date: Wed, 10 Apr 2024 08:06:15 GMT
- Title: Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models
- Authors: Taegyun Kwon, Dasaem Jeong, Juhan Nam,
- Abstract summary: We propose novel architectures for convolutional recurrent neural networks.
We improve note-state sequence modeling by using a pitchwise LSTM.
We show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset.
- Score: 7.928003786376716
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, advancements in neural network designs and the availability of large-scale labeled datasets have led to significant improvements in the accuracy of piano transcription models. However, most previous work focused on high-performance offline transcription, neglecting deliberate consideration of model size. The goal of this work is to implement real-time inference for piano transcription while ensuring both high performance and lightweight. To this end, we propose novel architectures for convolutional recurrent neural networks, redesigning an existing autoregressive piano transcription model. First, we extend the acoustic module by adding a frequency-conditioned FiLM layer to the CNN module to adapt the convolutional filters on the frequency axis. Second, we improve note-state sequence modeling by using a pitchwise LSTM that focuses on note-state transitions within a note. In addition, we augment the autoregressive connection with an enhanced recursive context. Using these components, we propose two types of models; one for high performance and the other for high compactness. Through extensive experiments, we show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset. We also investigate the effective model size and real-time inference latency by gradually streamlining the architecture. Finally, we conduct cross-data evaluation on unseen piano datasets and in-depth analysis to elucidate the effect of the proposed components in the view of note length and pitch range.
Related papers
- sTransformer: A Modular Approach for Extracting Inter-Sequential and Temporal Information for Time-Series Forecasting [6.434378359932152]
We review and categorize existing Transformer-based models into two main types: (1) modifications to the model structure and (2) modifications to the input data.
We propose $textbfsTransformer$, which introduces the Sequence and Temporal Convolutional Network (STCN) to fully capture both sequential and temporal information.
We compare our model with linear models and existing forecasting models on long-term time-series forecasting, achieving new state-of-the-art results.
arXiv Detail & Related papers (2024-08-19T06:23:41Z) - KFD-NeRF: Rethinking Dynamic NeRF with Kalman Filter [49.85369344101118]
We introduce KFD-NeRF, a novel dynamic neural radiance field integrated with an efficient and high-quality motion reconstruction framework based on Kalman filtering.
Our key idea is to model the dynamic radiance field as a dynamic system whose temporally varying states are estimated based on two sources of knowledge: observations and predictions.
Our KFD-NeRF demonstrates similar or even superior performance within comparable computational time and state-of-the-art view synthesis performance with thorough training.
arXiv Detail & Related papers (2024-07-18T05:48:24Z) - Orchid: Flexible and Data-Dependent Convolution for Sequence Modeling [4.190836962132713]
This paper introduces Orchid, a novel architecture designed to address the quadratic complexity of traditional attention mechanisms.
At the core of this architecture lies a new data-dependent global convolution layer, which contextually adapts its conditioned kernel on input sequence.
We evaluate the proposed model across multiple domains, including language modeling and image classification, to highlight its performance and generality.
arXiv Detail & Related papers (2024-02-28T17:36:45Z) - Data-efficient Large Vision Models through Sequential Autoregression [58.26179273091461]
We develop an efficient, autoregression-based vision model on a limited dataset.
We demonstrate how this model achieves proficiency in a spectrum of visual tasks spanning both high-level and low-level semantic understanding.
Our empirical evaluations underscore the model's agility in adapting to various tasks, heralding a significant reduction in the parameter footprint.
arXiv Detail & Related papers (2024-02-07T13:41:53Z) - Cross-modal Prompts: Adapting Large Pre-trained Models for Audio-Visual
Downstream Tasks [55.36987468073152]
This paper proposes a novel Dual-Guided Spatial-Channel-Temporal (DG-SCT) attention mechanism.
The DG-SCT module incorporates trainable cross-modal interaction layers into pre-trained audio-visual encoders.
Our proposed model achieves state-of-the-art results across multiple downstream tasks, including AVE, AVVP, AVS, and AVQA.
arXiv Detail & Related papers (2023-11-09T05:24:20Z) - Towards Improving Harmonic Sensitivity and Prediction Stability for
Singing Melody Extraction [36.45127093978295]
We propose an input feature modification and a training objective modification based on two assumptions.
To enhance the model's sensitivity on the trailing harmonics, we modify the Combined Frequency and Periodicity representation using discrete z-transform.
We apply these modifications to several models, including MSNet, FTANet, and a newly introduced model, PianoNet, modified from a piano transcription network.
arXiv Detail & Related papers (2023-08-04T21:59:40Z) - Unsupervised Continual Semantic Adaptation through Neural Rendering [32.099350613956716]
We study continual multi-scene adaptation for the task of semantic segmentation.
We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model.
We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method.
arXiv Detail & Related papers (2022-11-25T09:31:41Z) - Anomaly Detection of Time Series with Smoothness-Inducing Sequential
Variational Auto-Encoder [59.69303945834122]
We present a Smoothness-Inducing Sequential Variational Auto-Encoder (SISVAE) model for robust estimation and anomaly detection of time series.
Our model parameterizes mean and variance for each time-stamp with flexible neural networks.
We show the effectiveness of our model on both synthetic datasets and public real-world benchmarks.
arXiv Detail & Related papers (2021-02-02T06:15:15Z) - Polyphonic Piano Transcription Using Autoregressive Multi-State Note
Model [6.65616155956618]
We propose a unified neural network architecture where multiple note states are predicted as a softmax output with a single loss function.
We show that our proposed model achieves performance comparable to state-of-the-arts with fewer parameters.
arXiv Detail & Related papers (2020-10-02T17:03:19Z) - Dynamic Model Pruning with Feedback [64.019079257231]
We propose a novel model compression method that generates a sparse trained model without additional overhead.
We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models.
arXiv Detail & Related papers (2020-06-12T15:07:08Z) - Convolutional Tensor-Train LSTM for Spatio-temporal Learning [116.24172387469994]
We propose a higher-order LSTM model that can efficiently learn long-term correlations in the video sequence.
This is accomplished through a novel tensor train module that performs prediction by combining convolutional features across time.
Our results achieve state-of-the-art performance-art in a wide range of applications and datasets.
arXiv Detail & Related papers (2020-02-21T05:00:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.