PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective
- URL: http://arxiv.org/abs/2508.01488v2
- Date: Mon, 27 Oct 2025 11:55:07 GMT
- Title: PESTO: Real-Time Pitch Estimation with Self-supervised Transposition-equivariant Objective
- Authors: Alain Riou, Bernardo Torres, Ben Hayes, Stefan Lattner, Gaëtan Hadjeres, Gaël Richard, Geoffroy Peeters,
- Abstract summary: PESTO is a self-supervised learning approach for single-pitch estimation.<n>We develop a streamable VQT implementation using cached convolutions.
- Score: 28.829305407116962
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we introduce PESTO, a self-supervised learning approach for single-pitch estimation using a Siamese architecture. Our model processes individual frames of a Variable-$Q$ Transform (VQT) and predicts pitch distributions. The neural network is designed to be equivariant to translations, notably thanks to a Toeplitz fully-connected layer. In addition, we construct pitch-shifted pairs by translating and cropping the VQT frames and train our model with a novel class-based transposition-equivariant objective, eliminating the need for annotated data. Thanks to this architecture and training objective, our model achieves remarkable performances while being very lightweight ($130$k parameters). Evaluations on music and speech datasets (MIR-1K, MDB-stem-synth, and PTDB) demonstrate that PESTO not only outperforms self-supervised baselines but also competes with supervised methods, exhibiting superior cross-dataset generalization. Finally, we enhance PESTO's practical utility by developing a streamable VQT implementation using cached convolutions. Combined with our model's low latency (less than 10 ms) and minimal parameter count, this makes PESTO particularly suitable for real-time applications.
Related papers
- Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols [123.73663884421272]
Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.<n>We establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets.<n>By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research.
arXiv Detail & Related papers (2026-02-28T05:41:57Z) - Bottleneck Transformer-Based Approach for Improved Automatic STOI Score Prediction [16.426476430697587]
We present a novel approach to predict the Short-Time Objective Intelligibility (STOI) metric using a bottleneck transformer architecture.<n>Our model has shown higher correlation and lower mean squared error for both seen and unseen scenarios.
arXiv Detail & Related papers (2026-02-17T10:46:54Z) - Beyond Sequences: A Benchmark for Atomic Hand-Object Interaction Using a Static RNN Encoder [0.0]
We introduce a structured data engineering process that converts raw videos from the MANIAC dataset into 27,476 statistical-kinematic feature vectors.<n>Our model successfully overcame the most challenging transitional class, 'grabbing', by achieving a balanced F1-score of 0.90.<n>These findings provide a new benchmark for low-level hand-object interaction recognition using structured, interpretable features and lightweight architectures.
arXiv Detail & Related papers (2025-12-10T13:11:43Z) - Test time training enhances in-context learning of nonlinear functions [51.56484100374058]
Test-time training (TTT) enhances model performance by explicitly updating designated parameters prior to each prediction.<n>We investigate the combination of TTT with in-context learning (ICL), where the model is given a few examples from the target distribution at inference time.
arXiv Detail & Related papers (2025-09-30T03:56:44Z) - Estimating Time Series Foundation Model Transferability via In-Context Learning [74.65355820906355]
Time series foundation models (TSFMs) offer strong zero-shot forecasting via large-scale pre-training.<n>Fine-tuning remains critical for boosting performance in domains with limited public data.<n>We introduce TimeTic, a transferability estimation framework that recasts model selection as an in-context-learning problem.
arXiv Detail & Related papers (2025-09-28T07:07:13Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - Are Self-Attentions Effective for Time Series Forecasting? [4.990206466948269]
Time series forecasting is crucial for applications across multiple domains and various scenarios.<n>Recent findings have indicated that simpler linear models might outperform complex Transformer-based approaches.<n>We introduce a new architecture, Cross-Attention-only Time Series transformer (CATS)<n>Our model achieves superior performance with the lowest mean squared error and uses fewer parameters compared to existing models.
arXiv Detail & Related papers (2024-05-27T06:49:39Z) - Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models [7.928003786376716]
We propose novel architectures for convolutional recurrent neural networks.
We improve note-state sequence modeling by using a pitchwise LSTM.
We show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset.
arXiv Detail & Related papers (2024-04-10T08:06:15Z) - Structural Self-Supervised Objectives for Transformers [3.018656336329545]
This thesis focuses on improving the pre-training of natural language models using unsupervised raw data.
In the first part, we introduce three alternative pre-training objectives to BERT's Masked Language Modeling (MLM)
In the second part, we proposes self-supervised pre-training tasks that align structurally with downstream applications.
arXiv Detail & Related papers (2023-09-15T09:30:45Z) - Towards Unified Token Learning for Vision-Language Tracking [65.96561538356315]
We present a vision-language (VL) tracking pipeline, termed textbfMMTrack, which casts VL tracking as a token generation task.
Our proposed framework serializes language description and bounding box into a sequence of discrete tokens.
In this new design paradigm, all token queries are required to perceive the desired target and directly predict spatial coordinates of the target.
arXiv Detail & Related papers (2023-08-27T13:17:34Z) - Consensus-Adaptive RANSAC [104.87576373187426]
We propose a new RANSAC framework that learns to explore the parameter space by considering the residuals seen so far via a novel attention layer.
The attention mechanism operates on a batch of point-to-model residuals, and updates a per-point estimation state to take into account the consensus found through a lightweight one-step transformer.
arXiv Detail & Related papers (2023-07-26T08:25:46Z) - How to Estimate Model Transferability of Pre-Trained Speech Models? [84.11085139766108]
"Score-based assessment" framework for estimating transferability of pre-trained speech models.
We leverage upon two representation theories, Bayesian likelihood estimation and optimal transport, to generate rank scores for the PSM candidates.
Our framework efficiently computes transferability scores without actual fine-tuning of candidate models or layers.
arXiv Detail & Related papers (2023-06-01T04:52:26Z) - Self-Supervised Pre-Training for Transformer-Based Person
Re-Identification [54.55281692768765]
Transformer-based supervised pre-training achieves great performance in person re-identification (ReID)
Due to the domain gap between ImageNet and ReID datasets, it usually needs a larger pre-training dataset to boost the performance.
This work aims to mitigate the gap between the pre-training and ReID datasets from the perspective of data and model structure.
arXiv Detail & Related papers (2021-11-23T18:59:08Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z) - DP-NormFedAvg: Normalizing Client Updates for Privacy-Preserving
Federated Learning [48.064786028195506]
We propose to have the clients send a textitfin quantized version of only the textitunit in terms of magnitude information.
We also introduce QTDL, a new differentially private quantization mechanism for unitnorm.
arXiv Detail & Related papers (2021-06-13T21:23:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.