SwiftF0: Fast and Accurate Monophonic Pitch Detection
- URL: http://arxiv.org/abs/2508.18440v1
- Date: Mon, 25 Aug 2025 19:39:20 GMT
- Title: SwiftF0: Fast and Accurate Monophonic Pitch Detection
- Authors: Lars Nieradzik,
- Abstract summary: We present emphSwiftF0, a novel, lightweight neural model that sets a new state-of-the-art for monophonic pitch estimation.<n>SwiftF0 achieves robust generalization across acoustic domains while maintaining computational efficiency.
- Score: 2.8766374696553823
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Accurate and real-time monophonic pitch estimation in noisy conditions, particularly on resource-constrained devices, remains an open challenge in audio processing. We present \emph{SwiftF0}, a novel, lightweight neural model that sets a new state-of-the-art for monophonic pitch estimation. Through training on diverse speech, music, and synthetic datasets with extensive data augmentation, SwiftF0 achieves robust generalization across acoustic domains while maintaining computational efficiency. SwiftF0 achieves a 91.80\% harmonic mean (HM) at 10 dB SNR, outperforming baselines like CREPE by over 12 percentage points and degrading by only 2.3 points from clean audio. SwiftF0 requires only 95,842 parameters and runs approximately 42x faster than CREPE on CPU, making it ideal for efficient, real-time deployment. To address the critical lack of perfectly accurate ground truth pitch in speech corpora (which typically rely on algorithmic estimators or laryngograph signals), we introduce \emph{SpeechSynth}. This synthetic speech dataset, generated by a phoneme-level TTS model, provides exact, on-demand ground-truth pitch curves, enabling more robust model training and evaluation. Furthermore, we propose a unified metric, combining six complementary performance measures for comprehensive and reliable pitch evaluation, and release an open-source pitch benchmark suite. A live demo of SwiftF0 is available at https://swift-f0.github.io/, the source code at https://github.com/lars76/swift-f0, and the benchmark framework at https://github.com/lars76/pitch-benchmark.
Related papers
- Voxtral Realtime [134.66962524291424]
Voxtral Realtime is a streaming automatic speech recognition model.<n>It matches offline transcription quality at sub-second latency.<n>We release the model weights under the Apache 2.0 license.
arXiv Detail & Related papers (2026-02-11T19:17:10Z) - FCPE: A Fast Context-based Pitch Estimation Model [10.788664167503676]
We propose a fast context-based pitch estimation model that captures mel spectrogram features while maintaining low computational cost and robust noise tolerance.<n>Experiments show that our method achieves 96.79% Raw Pitch Accuracy (RPA) on the MIR-1K dataset, on par with the state-of-the-art methods.
arXiv Detail & Related papers (2025-09-18T16:50:09Z) - MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows [13.130255838403002]
MeanAudio is a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE)<n>We demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation.
arXiv Detail & Related papers (2025-08-08T07:49:59Z) - Lina-Speech: Gated Linear Attention is a Fast and Parameter-Efficient Learner for text-to-speech synthesis [7.2129341612013285]
We introduce Lina-Speech, a model that replaces traditional self-attention mechanisms with emerging recurrent architectures like Gated Linear Attention (GLA)
This approach is fast, easy to deploy, and achieves performance comparable to fine-tuned baselines when the dataset size ranges from 3 to 15 minutes.
arXiv Detail & Related papers (2024-10-30T04:50:40Z) - Exploring WavLM Back-ends for Speech Spoofing and Deepfake Detection [0.0]
ASVspoof 5 Challenge Track 1: Speech Deepfake Detection - Open Condition consists of a stand-alone speech deepfake (bonafide vs spoof) detection task.
We leverage a pre-trained WavLM as a front-end model and pool its representations with different back-end techniques.
Our fused system achieves 0.0937 minDCF, 3.42% EER, 0.1927 Cllr, and 0.1375 actDCF.
arXiv Detail & Related papers (2024-09-08T08:54:36Z) - VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment [101.2489492032816]
VALL-E R is a robust and efficient zero-shot Text-to-Speech system.
This research has the potential to be applied to meaningful projects, including the creation of speech for those affected by aphasia.
arXiv Detail & Related papers (2024-06-12T04:09:44Z) - F-COREF: Fast, Accurate and Easy to Use Coreference Resolution [48.05751101475403]
We introduce fastcoref, a python package for fast, accurate, and easy-to-use English coreference resolution.
model allows to process 2.8K OntoNotes documents in 25 seconds on a V100 GPU.
arXiv Detail & Related papers (2022-09-09T12:52:28Z) - Fast DCTTS: Efficient Deep Convolutional Text-to-Speech [8.276202368107006]
We propose an end-to-end speech synthesizer, Fast DCTTS, that synthesizes speech in real time on a single CPU thread.
The proposed model is composed of a carefully-tuned lightweight network designed by applying multiple network reduction and fidelity improvement techniques.
arXiv Detail & Related papers (2021-04-01T17:08:01Z) - DEEPF0: End-To-End Fundamental Frequency Estimation for Music and Speech
Signals [11.939409227407769]
We propose a novel pitch estimation technique called DeepF0.
It leverages the available annotated data to directly learn from the raw audio in a data-driven manner.
arXiv Detail & Related papers (2021-02-11T23:11:22Z) - FBWave: Efficient and Scalable Neural Vocoders for Streaming
Text-To-Speech on the Edge [49.85380252780985]
We propose FBWave, a family of efficient and scalable neural vocoders.
FBWave is a hybrid flow-based generative model that combines the advantages of autoregressive and non-autoregressive models.
Our experiments show that FBWave can achieve similar audio quality to WaveRNN while reducing MACs by 40x.
arXiv Detail & Related papers (2020-11-25T19:09:49Z) - Real-Time Execution of Large-scale Language Models on Mobile [49.32610509282623]
We find the best model structure of BERT for a given computation size to match specific devices.
Our framework can guarantee the identified model to meet both resource and real-time specifications of mobile devices.
Specifically, our model is 5.2x faster on CPU and 4.1x faster on GPU with 0.5-2% accuracy loss compared with BERT-base.
arXiv Detail & Related papers (2020-09-15T01:59:17Z) - ContextNet: Improving Convolutional Neural Networks for Automatic Speech
Recognition with Global Context [58.40112382877868]
We propose a novel CNN-RNN-transducer architecture, which we call ContextNet.
ContextNet features a fully convolutional encoder that incorporates global context information into convolution layers by adding squeeze-and-excitation modules.
We demonstrate that ContextNet achieves a word error rate (WER) of 2.1%/4.6% without external language model (LM), 1.9%/4.1% with LM and 2.9%/7.0% with only 10M parameters on the clean/noisy LibriSpeech test sets.
arXiv Detail & Related papers (2020-05-07T01:03:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.