Related papers: Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music

Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music

URL: http://arxiv.org/abs/2601.11768v1
Date: Fri, 16 Jan 2026 20:46:33 GMT
Title: Lightweight Self-Supervised Detection of Fundamental Frequency and Accurate Probability of Voicing in Monophonic Music
Authors: Venkat Suprabath Bitra, Homayoon Beigi,
Abstract summary: We propose a lightweight, fully self-supervised framework for joint F 0 estimation and voicing inference.<n>Our method achieves competitive cross-corpus performance (RPA 95.84, RCA 96.24) and demonstrates cross-instrument generalization.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reliable fundamental frequency (F 0) and voicing estimation is essential for neural synthesis, yet many pitch extractors depend on large labeled corpora and degrade under realistic recording artifacts. We propose a lightweight, fully self-supervised framework for joint F 0 estimation and voicing inference, designed for rapid single-instrument training from limited audio. Using transposition-equivariant learning on CQT features, we introduce an EM-style iterative reweighting scheme that uses Shift Cross-Entropy (SCE) consistency as a reliability signal to suppress uninformative noisy/unvoiced frames. The resulting weights provide confidence scores that enable pseudo-labeling for a separate lightweight voicing classifier without manual annotations. Trained on MedleyDB and evaluated on MDB-stem-synth ground truth, our method achieves competitive cross-corpus performance (RPA 95.84, RCA 96.24) and demonstrates cross-instrument generalization.

Related papers

Pay Attention to CTC: Fast and Robust Pseudo-Labelling for Unified Speech Recognition [61.39209522608919]
Unified Speech Recognition has emerged as a semi-supervised framework for training a single model for audio, visual, and audiovisual speech recognition.<n>We propose CTC-driven teacher forcing, where greedily decoded CTC pseudo-labels are fed into the decoder to generate attention targets.<n>Because CTC and CTC-driven attention pseudo-labels have the same length, the decoder can predict both simultaneously.
arXiv Detail & Related papers (2026-02-22T19:38:21Z)
Learning to Separate RF Signals Under Uncertainty: Detect-Then-Separate vs. Unified Joint Models [53.79667447811139]
We show that a single deep neural architecture learns to jointly detect and separate when applied directly to the received signal.<n>These findings highlight UJM as a scalable and practical alternative to DTS, while opening new directions for unified separation under broader estimation.
arXiv Detail & Related papers (2026-02-04T15:25:02Z)
Adaptive Evidence Weighting for Audio-Spatiotemporal Fusion [0.0]
In bioacoustic classification, species identity may be inferred both from the acoustic signal and from context as location and season.<n>We introduce FINCH, an adaptive log-linear evidence fusion framework that integrates a pre-trainedtext audio classifier with a structuredtemporal predictor.<n>FINCH consistently outperforms fixed-weight fusion and audio-only baselines, improving robustness and error trade-offs.
arXiv Detail & Related papers (2026-02-03T18:21:13Z)
Domain-Incremental Continual Learning for Robust and Efficient Keyword Spotting in Resource Constrained Systems [0.0]
Keywords Spotting systems with small footprint models deployed on edge devices face significant accuracy and robustness challenges.<n>We propose a comprehensive framework for continual learning designed to adapt to new domains while maintaining computational efficiency.<n>The proposed pipeline integrates a dual-input Convolutional Neural Network, utilizing both Mel Frequency Cepstral Coefficients (MFCC) and Mel-spectrogram features.
arXiv Detail & Related papers (2026-01-22T17:59:31Z)
Noise-Adaptive Regularization for Robust Multi-Label Remote Sensing Image Classification [5.658568324275769]
We propose NAR, a noise-adaptive regularization method that distinguishes between additive and subtractive noise.<n> NAR consistently improves robustness compared with existing methods.<n>Performance improvements are most pronounced under subtractive and mixed noise.
arXiv Detail & Related papers (2026-01-13T11:16:45Z)
Exploiting Radio Frequency Fingerprints for Device Identification: Tackling Cross-receiver Challenges in the Source-data-free Scenario [17.211137756661955]
We present a source-data-free cross-receiver RFFI problem, where a model pretrained on labeled signals from a source receiver must adapt to unlabeled signals from a target receiver.<n>We propose Momentum Soft pseudo-label Source Hypothesis Transfer (MS-SHOT), a new method for SCRFFI that incorporates momentum-center-guided soft pseudo-labeling and enforces global structural constraints.<n>MS-SHOT consistently outperforms existing approaches in both accuracy and robustness, offering a practical and scalable solution for source-data-free cross-receiver adaptation in RFFI.
arXiv Detail & Related papers (2025-12-18T15:20:33Z)
Cross-Attention with Confidence Weighting for Multi-Channel Audio Alignment [5.380078543698624]
Multi-channel audio alignment is a key requirement in bioacoustic monitoring, spatial audio systems, and acoustic localization.<n>We introduce a method that combines cross-attention mechanisms with confidence-weighted scoring to improve multi-channel audio synchronization.<n>Our method achieved first place in the BioDCASE 2025 Task 1 challenge with 0.30 MSE average across test datasets, compared to 0.58 for the deep learning baseline.
arXiv Detail & Related papers (2025-09-21T05:14:06Z)
Reproducible Machine Learning-based Voice Pathology Detection: Introducing the Pitch Difference Feature [1.7779568951268254]
We introduce a novel methodology for voice pathology detection using the publicly available Saarbr"ucken Voice Database.<n>We evaluate six machine learning (ML) algorithms -- support vector machine, k-nearest neighbors, naive Bayes, decision tree, random forest, and AdaBoost.<n>Our approach 85.61%, 84.69% and 85.22% unweighted average recall (UAR) for females, males and combined results respectively.
arXiv Detail & Related papers (2024-10-14T14:17:52Z)
SSP-RACL: Classification of Noisy Fundus Images with Self-Supervised Pretraining and Robust Adaptive Credal Loss [3.8739860035485143]
Fundus image classification is crucial in the computer aided diagnosis tasks, but label noise significantly impairs the performance of deep neural networks. We propose a robust framework, Self-Supervised Pre-training with Robust Adaptive Credal Loss (SSP-RACL), for handling label noise in fundus image datasets.
arXiv Detail & Related papers (2024-09-25T02:41:58Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
Confidence-aware Training of Smoothed Classifiers for Certified Robustness [75.95332266383417]
We use "accuracy under Gaussian noise" as an easy-to-compute proxy of adversarial robustness for an input. Our experiments show that the proposed method consistently exhibits improved certified robustness upon state-of-the-art training methods.
arXiv Detail & Related papers (2022-12-18T03:57:12Z)
Disentangled Representation Learning for RF Fingerprint Extraction under Unknown Channel Statistics [77.13542705329328]
We propose a framework of disentangled representation learning(DRL) that first learns to factor the input signals into a device-relevant component and a device-irrelevant component via adversarial learning. The implicit data augmentation in the proposed framework imposes a regularization on the RFF extractor to avoid the possible overfitting of device-irrelevant channel statistics. Experiments validate that the proposed approach, referred to as DR-RFF, outperforms conventional methods in terms of generalizability to unknown complicated propagation environments.
arXiv Detail & Related papers (2022-08-04T15:46:48Z)
Model-based Deep Learning Receiver Design for Rate-Splitting Multiple Access [65.21117658030235]
This work proposes a novel design for a practical RSMA receiver based on model-based deep learning (MBDL) methods. The MBDL receiver is evaluated in terms of uncoded Symbol Error Rate (SER), throughput performance through Link-Level Simulations (LLS) and average training overhead. Results reveal that the MBDL outperforms by a significant margin the SIC receiver with imperfect CSIR.
arXiv Detail & Related papers (2022-05-02T12:23:55Z)
S3: Supervised Self-supervised Learning under Label Noise [53.02249460567745]
In this paper we address the problem of classification in the presence of label noise. In the heart of our method is a sample selection mechanism that relies on the consistency between the annotated label of a sample and the distribution of the labels in its neighborhood in the feature space. Our method significantly surpasses previous methods on both CIFARCIFAR100 with artificial noise and real-world noisy datasets such as WebVision and ANIMAL-10N.
arXiv Detail & Related papers (2021-11-22T15:49:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.