Related papers: FCPE: A Fast Context-based Pitch Estimation Model

FCPE: A Fast Context-based Pitch Estimation Model

URL: http://arxiv.org/abs/2509.15140v1
Date: Thu, 18 Sep 2025 16:50:09 GMT
Title: FCPE: A Fast Context-based Pitch Estimation Model
Authors: Yuxin Luo, Ruoyi Zhang, Lu-Chuan Liu, Tianyu Li, Hangyu Liu,
Abstract summary: We propose a fast context-based pitch estimation model that captures mel spectrogram features while maintaining low computational cost and robust noise tolerance.<n>Experiments show that our method achieves 96.79% Raw Pitch Accuracy (RPA) on the MIR-1K dataset, on par with the state-of-the-art methods.
Score: 10.788664167503676
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Pitch estimation (PE) in monophonic audio is crucial for MIDI transcription and singing voice conversion (SVC), but existing methods suffer significant performance degradation under noise. In this paper, we propose FCPE, a fast context-based pitch estimation model that employs a Lynx-Net architecture with depth-wise separable convolutions to effectively capture mel spectrogram features while maintaining low computational cost and robust noise tolerance. Experiments show that our method achieves 96.79\% Raw Pitch Accuracy (RPA) on the MIR-1K dataset, on par with the state-of-the-art methods. The Real-Time Factor (RTF) is 0.0062 on a single RTX 4090 GPU, which significantly outperforms existing algorithms in efficiency. Code is available at https://github.com/CNChTu/FCPE.

Related papers

DoPE: Denoising Rotary Position Embedding [60.779039511252584]
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length.<n>We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional extrapolation page (DoPE)<n>DoPE is a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map.
arXiv Detail & Related papers (2025-11-12T09:32:35Z)
SwiftF0: Fast and Accurate Monophonic Pitch Detection [2.8766374696553823]
We present emphSwiftF0, a novel, lightweight neural model that sets a new state-of-the-art for monophonic pitch estimation.<n>SwiftF0 achieves robust generalization across acoustic domains while maintaining computational efficiency.
arXiv Detail & Related papers (2025-08-25T19:39:20Z)
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows [13.130255838403002]
MeanAudio is a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE)<n>We demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation.
arXiv Detail & Related papers (2025-08-08T07:49:59Z)
Boundary-aware Decoupled Flow Networks for Realistic Extreme Rescaling [49.215957313126324]
Recently developed generative methods, including invertible rescaling network (IRN) based and generative adversarial network (GAN) based methods, have demonstrated exceptional performance in image rescaling. However, IRN-based methods tend to produce over-smoothed results, while GAN-based methods easily generate fake details. We propose Boundary-aware Decoupled Flow Networks (BDFlow) to generate realistic and visually pleasing results.
arXiv Detail & Related papers (2024-05-05T14:05:33Z)
CFDP: Common Frequency Domain Pruning [0.3021678014343889]
We introduce a novel end-to-end pipeline for model pruning via the frequency domain. We have achieved state-of-the-art results on CIFAR-10 with GoogLeNet reaching an accuracy of 95.25%, that is, +0.2% from the original model. In addition to notable performances, models produced via CFDP exhibit robustness to a variety of configurations.
arXiv Detail & Related papers (2023-06-07T04:49:26Z)
Simple Pooling Front-ends For Efficient Audio Classification [56.59107110017436]
We show that eliminating the temporal redundancy in the input audio features could be an effective approach for efficient audio classification. We propose a family of simple pooling front-ends (SimPFs) which use simple non-parametric pooling operations to reduce the redundant information. SimPFs can achieve a reduction in more than half of the number of floating point operations for off-the-shelf audio neural networks.
arXiv Detail & Related papers (2022-10-03T14:00:41Z)
DEEPF0: End-To-End Fundamental Frequency Estimation for Music and Speech Signals [11.939409227407769]
We propose a novel pitch estimation technique called DeepF0. It leverages the available annotated data to directly learn from the raw audio in a data-driven manner.
arXiv Detail & Related papers (2021-02-11T23:11:22Z)
Learning based signal detection for MIMO systems with unknown noise statistics [84.02122699723536]
This paper aims to devise a generalized maximum likelihood (ML) estimator to robustly detect signals with unknown noise statistics. In practice, there is little or even no statistical knowledge on the system noise, which in many cases is non-Gaussian, impulsive and not analyzable. Our framework is driven by an unsupervised learning approach, where only the noise samples are required.
arXiv Detail & Related papers (2021-01-21T04:48:15Z)
Non-Parametric Adaptive Network Pruning [125.4414216272874]
We introduce non-parametric modeling to simplify the algorithm design. Inspired by the face recognition community, we use a message passing algorithm to obtain an adaptive number of exemplars. EPruner breaks the dependency on the training data in determining the "important" filters.
arXiv Detail & Related papers (2021-01-20T06:18:38Z)
Real Time Speech Enhancement in the Waveform Domain [99.02180506016721]
We present a causal speech enhancement model working on the raw waveform that runs in real-time on a laptop CPU. The proposed model is based on an encoder-decoder architecture with skip-connections. It is capable of removing various kinds of background noise including stationary and non-stationary noises.
arXiv Detail & Related papers (2020-06-23T09:19:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.