BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference
- URL: http://arxiv.org/abs/2511.20006v1
- Date: Tue, 25 Nov 2025 07:16:49 GMT
- Title: BERT-APC: A Reference-free Framework for Automatic Pitch Correction via Musical Context Inference
- Authors: Sungjae Kim, Kihyun Na, Jinyoung Choi, Injung Kim,
- Abstract summary: BERT-APC is a novel reference-free Automatic Pitch Correction framework.<n>It corrects pitch errors while maintaining the natural expressiveness of vocal performances.<n>BERT-APC demonstrated superior performance in note pitch prediction, outperforming the second-best model, ROSVOT, by 10.49%p on highly detuned samples.
- Score: 6.7611107349018456
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic Pitch Correction (APC) enhances vocal recordings by aligning pitch deviations with the intended musical notes. However, existing APC systems either rely on reference pitches, which limits their practical applicability, or employ simple pitch estimation algorithms that often fail to preserve expressiveness and naturalness. We propose BERT-APC, a novel reference-free APC framework that corrects pitch errors while maintaining the natural expressiveness of vocal performances. In BERT-APC, a novel stationary pitch predictor first estimates the perceived pitch of each note from the detuned singing voice. A context-aware note pitch predictor estimates the intended pitch sequence by leveraging a music language model repurposed to incorporate musical context. Finally, a note-level correction algorithm fixes pitch errors while preserving intentional pitch deviations for emotional expression. In addition, we introduce a learnable data augmentation strategy that improves the robustness of the music language model by simulating realistic detuning patterns. Compared to two recent singing voice transcription models, BERT-APC demonstrated superior performance in note pitch prediction, outperforming the second-best model, ROSVOT, by 10.49%p on highly detuned samples in terms of the raw pitch accuracy. In the MOS test, BERT-APC achieved the highest score of $4.32 \pm 0.15$, which is significantly higher than those of the widely-used commercial APC tools, AutoTune ($3.22 \pm 0.18$) and Melodyne ($3.08 \pm 0.18$), while maintaining a comparable ability to preserve expressive nuances. To the best of our knowledge, this is the first APC model that leverages a music language model to achieve reference-free pitch correction with symbolic musical context. The corrected audio samples of BERT-APC are available online.
Related papers
- EditGen: Harnessing Cross-Attention Control for Instruction-Based Auto-Regressive Audio Editing [54.10773655199149]
We investigate leveraging cross-attention control for efficient audio editing within auto-regressive models.<n>Inspired by image editing methodologies, we develop a Prompt-to-Prompt-like approach that guides edits through cross and self-attention mechanisms.
arXiv Detail & Related papers (2025-07-15T08:44:11Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - FretNet: Continuous-Valued Pitch Contour Streaming for Polyphonic Guitar
Tablature Transcription [0.34376560669160383]
In certain applications, such as Guitar Tablature Transcription (GTT), it is more meaningful to estimate continuous-valued pitch contours.
We present a GTT formulation that estimates continuous-valued pitch contours, grouping them according to their string and fret of origin.
We demonstrate that for this task, the proposed method significantly improves the resolution of MPE and simultaneously yields tablature estimation results competitive with baseline models.
arXiv Detail & Related papers (2022-12-06T14:51:27Z) - Unaligned Supervision For Automatic Music Transcription in The Wild [1.2183405753834562]
NoteEM is a method for simultaneously training a transcriber and aligning the scores to their corresponding performances.
We report SOTA note-level accuracy of the MAPS dataset, and large favorable margins on cross-dataset evaluations.
arXiv Detail & Related papers (2022-04-28T17:31:43Z) - Enhancement of Pitch Controllability using Timbre-Preserving Pitch
Augmentation in FastPitch [3.858078488714278]
We propose two algorithms to improve the robustness of FastPitch.
First, we propose a novel timbre-preserving pitch-shifting algorithm for natural pitch augmentation.
The experimental results demonstrate that the proposed algorithms improve the pitch controllability of FastPitch.
arXiv Detail & Related papers (2022-04-12T12:48:06Z) - Learning the Beauty in Songs: Neural Singing Voice Beautifier [69.21263011242907]
We are interested in a novel task, singing voice beautifying (SVB)
Given the singing voice of an amateur singer, SVB aims to improve the intonation and vocal tone of the voice, while keeping the content and vocal timbre.
We introduce Neural Singing Voice Beautifier (NSVB), the first generative model to solve the SVB task.
arXiv Detail & Related papers (2022-02-27T03:10:12Z) - Generating Music with a Self-Correcting Non-Chronological Autoregressive
Model [6.289267097017553]
We describe a novel approach for generating music using a self-correcting, non-chronological, autoregressive model.
We represent music as a sequence of edit events, each of which denotes either the addition or removal of a note.
During inference, we generate one edit event at a time using direct ancestral sampling.
arXiv Detail & Related papers (2020-08-18T20:36:47Z) - Unsupervised Cross-Domain Singing Voice Conversion [105.1021715879586]
We present a wav-to-wav generative model for the task of singing voice conversion from any identity.
Our method utilizes both an acoustic model, trained for the task of automatic speech recognition, together with melody extracted features to drive a waveform-based generator.
arXiv Detail & Related papers (2020-08-06T18:29:11Z) - Deep Autotuner: a Pitch Correcting Network for Singing Performances [26.019582802302033]
We introduce a data-driven approach to automatic pitch correction of solo singing performances.
We train our neural network model using a dataset of 4,702 amateur karaoke performances selected for good intonation.
The proposed deep neural network with gated recurrent units on top of convolutional layers shows promising performance on the real-world score-free singing pitch correction task of autotuning.
arXiv Detail & Related papers (2020-02-12T01:33:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.