Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music
- URL: http://arxiv.org/abs/2601.11768v1
- Date: Fri, 16 Jan 2026 20:46:33 GMT
- Title: Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music
- Authors: Venkat Suprabath Bitra, Homayoon Beigi,
- Abstract summary: We propose a lightweight, fully self-supervised framework for joint F 0 estimation and voicing inference.<n>Our method achieves competitive cross-corpus performance (RPA 95.84, RCA 96.24) and demonstrates cross-instrument generalization.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reliable fundamental frequency (F 0) and voicing estimation is essential for neural synthesis, yet many pitch extractors depend on large labeled corpora and degrade under realistic recording artifacts. We propose a lightweight, fully self-supervised framework for joint F 0 estimation and voicing inference, designed for rapid single-instrument training from limited audio. Using transposition-equivariant learning on CQT features, we introduce an EM-style iterative reweighting scheme that uses Shift Cross-Entropy (SCE) consistency as a reliability signal to suppress uninformative noisy/unvoiced frames. The resulting weights provide confidence scores that enable pseudo-labeling for a separate lightweight voicing classifier without manual annotations. Trained on MedleyDB and evaluated on MDB-stem-synth ground truth, our method achieves competitive cross-corpus performance (RPA 95.84, RCA 96.24) and demonstrates cross-instrument generalization.
Related papers
- Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition [61.39209522608919]
Unified Speech Recognition has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition.<n>We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets.<n>Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously.
arXiv Detail & Related papers (2026-02-22T19:38:21Z) - Learning to Separate RF Signals Under Uncertainty: Detect-Then-Separate vs. Unified Joint Models [53.79667447811139]
We show that a single deep neural architecture learns to jointly detect and separate when applied directly to the received signal.<n>These findings highlight UJM as a scalable and practical alternative to DTS, while opening new directions for unified separation under broader estimation.
arXiv Detail & Related papers (2026-02-04T15:25:02Z) - Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion [0.0]
In bioacoustic classification, species identity may be inferred both from the acoustic signal and from context as location and season.<n>We introduce FINCH, an adaptive log-linear evidence fusion framework that integrates a pre-trainedtext audio classifier with a structuredtemporal predictor.<n>FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs.
arXiv Detail & Related papers (2026-02-03T18:21:13Z) - Domain-Incremental Continual Learning for Robust and Efficient Keyword Spotting in Resource Constrained Systems [0.0]
Keywords Spotting systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges.<n>We propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency.<n>The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features.
arXiv Detail & Related papers (2026-01-22T17:59:31Z) - Noise-Adaptive Regularization for Robust Multi-Label Remote Sensing Image Classification [5.658568324275769]
We propose NAR, a noise-adaptive regularization method that distinguishes between additive and subtractive noise.<n> NAR consistently improves robustness compared with existing methods.<n>Performance improvements are most pronounced under subtractive and mixed noise.
arXiv Detail & Related papers (2026-01-13T11:16:45Z) - Exploiting Radio Frequency Fingerprints for Device Identification: Tackling Cross-receiver Challenges in the Source-data-free Scenario [17.211137756661955]
We present a source-data-free cross-receiver RFFI problem, where a model pretrained on labeled signals from a source receiver must adapt to unlabeled signals from a target receiver.<n>We propose Momentum Soft pseudo-label Source Hypothesis Transfer (MS-SHOT), a new method for SCRFFI that incorporates momentum-center-guided soft pseudo-labeling and enforces global structural constraints.<n>MS-SHOT consistently outperforms existing approaches in both accuracy and robustness, offering a practical and scalable solution for source-data-free cross-receiver adaptation in RFFI.
arXiv Detail & Related papers (2025-12-18T15:20:33Z) - Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment [5.380078543698624]
Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization.<n>We introduce a method that combines cross-attention mechanisms with confidence-weighted scoring to improve multi-channel audio synchronization.<n>Our method achieved first place in the BioDCASE 2025 Task 1 challenge with 0.30 MSE average across test datasets, compared to 0.58 for the deep learning baseline.
arXiv Detail & Related papers (2025-09-21T05:14:06Z) - Reproducible Machine Learning-based Voice Pathology Detection: Introducing the Pitch Difference Feature [1.7779568951268254]
We introduce a novel methodology for voice pathology detection using the publicly available Saarbr"ucken Voice Database.<n>We evaluate six machine learning (ML) algorithms -- support vector machine, k-nearest neighbors, naive Bayes, decision tree, random forest, and AdaBoost.<n>Our approach 85.61%, 84.69% and 85.22% unweighted average recall (UAR) for females, males and combined results respectively.
arXiv Detail & Related papers (2024-10-14T14:17:52Z) - SSP-RACL: Classification of Noisy Fundus Images with Self-Supervised Pretraining and Robust Adaptive Credal Loss [3.8739860035485143]
Fundus image classification is crucial in the computer aided diagnosis tasks, but label noise significantly impairs the performance of deep neural networks.
We propose a robust framework, Self-Supervised Pre-training with Robust Adaptive Credal Loss (SSP-RACL), for handling label noise in fundus image datasets.
arXiv Detail & Related papers (2024-09-25T02:41:58Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - Confidence-aware Training of Smoothed Classifiers for Certified
Robustness [75.95332266383417]
We use "accuracy under Gaussian noise" as an easy-to-compute proxy of adversarial robustness for an input.
Our experiments show that the proposed method consistently exhibits improved certified robustness upon state-of-the-art training methods.
arXiv Detail & Related papers (2022-12-18T03:57:12Z) - Disentangled Representation Learning for RF Fingerprint Extraction under
Unknown Channel Statistics [77.13542705329328]
We propose a framework of disentangled representation learning(DRL) that first learns to factor the input signals into a device-relevant component and a device-irrelevant component via adversarial learning.
The implicit data augmentation in the proposed framework imposes a regularization on the RFF extractor to avoid the possible overfitting of device-irrelevant channel statistics.
Experiments validate that the proposed approach, referred to as DR-RFF, outperforms conventional methods in terms of generalizability to unknown complicated propagation environments.
arXiv Detail & Related papers (2022-08-04T15:46:48Z) - Model-based Deep Learning Receiver Design for Rate-Splitting Multiple
Access [65.21117658030235]
This work proposes a novel design for a practical RSMA receiver based on model-based deep learning (MBDL) methods.
The MBDL receiver is evaluated in terms of uncoded Symbol Error Rate (SER), throughput performance through Link-Level Simulations (LLS) and average training overhead.
Results reveal that the MBDL outperforms by a significant margin the SIC receiver with imperfect CSIR.
arXiv Detail & Related papers (2022-05-02T12:23:55Z) - S3: Supervised Self-supervised Learning under Label Noise [53.02249460567745]
In this paper we address the problem of classification in the presence of label noise.
In the heart of our method is a sample selection mechanism that relies on the consistency between the annotated label of a sample and the distribution of the labels in its neighborhood in the feature space.
Our method significantly surpasses previous methods on both CIFARCIFAR100 with artificial noise and real-world noisy datasets such as WebVision and ANIMAL-10N.
arXiv Detail & Related papers (2021-11-22T15:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.