A Real-Time Lyrics Alignment System Using Chroma And Phonetic Features
For Classical Vocal Performance
- URL: http://arxiv.org/abs/2401.09200v1
- Date: Wed, 17 Jan 2024 13:25:32 GMT
- Title: A Real-Time Lyrics Alignment System Using Chroma And Phonetic Features
For Classical Vocal Performance
- Authors: Jiyun Park, Sangeon Yong, Taegyun Kwon, and Juhan Nam
- Abstract summary: The goal of real-time lyrics alignment is to take live singing audio as input and pinpoint the exact position within given lyrics on the fly.
The task can benefit real-world applications such as the automatic subtitling of live concerts or operas.
This paper presents a real-time lyrics alignment system for classical vocal performances with two contributions.
- Score: 7.488651253072641
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The goal of real-time lyrics alignment is to take live singing audio as input
and to pinpoint the exact position within given lyrics on the fly. The task can
benefit real-world applications such as the automatic subtitling of live
concerts or operas. However, designing a real-time model poses a great
challenge due to the constraints of only using past input and operating within
a minimal latency. Furthermore, due to the lack of datasets for real-time
models for lyrics alignment, previous studies have mostly evaluated with
private in-house datasets, resulting in a lack of standard evaluation methods.
This paper presents a real-time lyrics alignment system for classical vocal
performances with two contributions. First, we improve the lyrics alignment
algorithm by finding an optimal combination of chromagram and phonetic
posteriorgram (PPG) that capture melodic and phonetics features of the singing
voice, respectively. Second, we recast the Schubert Winterreise Dataset (SWD)
which contains multiple performance renditions of the same pieces as an
evaluation set for the real-time lyrics alignment.
Related papers
- Automatic Estimation of Singing Voice Musical Dynamics [9.343063100314687]
We propose a methodology for dataset curation.
We compile a dataset comprising 509 musical dynamics annotated singing voice performances, aligned with 163 score files.
We train a CNN model with varying window sizes to evaluate the effectiveness of estimating musical dynamics.
We conclude through our experiments that bark-scale based features outperform log-Mel-features for the task of singing voice dynamics prediction.
arXiv Detail & Related papers (2024-10-27T18:15:18Z) - Lyrics Transcription for Humans: A Readability-Aware Benchmark [1.2499537119440243]
We introduce Jam-ALT, a comprehensive lyrics transcription benchmark.
The benchmark features a complete revision of the JamendoLyrics dataset, along with evaluation metrics designed to capture and assess the lyric-specific nuances.
We apply the benchmark to recent transcription systems and present additional error analysis, as well as an experimental comparison with a classical music dataset.
arXiv Detail & Related papers (2024-07-30T14:20:09Z) - Cluster and Separate: a GNN Approach to Voice and Staff Prediction for Score Engraving [5.572472212662453]
This paper approaches the problem of separating the notes from a quantized symbolic music piece (e.g., a MIDI file) into multiple voices and staves.
We propose an end-to-end system based on graph neural networks that notes that belong to the same chord and connect them with edges if they are part of a voice.
arXiv Detail & Related papers (2024-07-15T14:36:13Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Unsupervised Melody-to-Lyric Generation [91.29447272400826]
We propose a method for generating high-quality lyrics without training on any aligned melody-lyric data.
We leverage the segmentation and rhythm alignment between melody and lyrics to compile the given melody into decoding constraints.
Our model can generate high-quality lyrics that are more on-topic, singable, intelligible, and coherent than strong baselines.
arXiv Detail & Related papers (2023-05-30T17:20:25Z) - RMSSinger: Realistic-Music-Score based Singing Voice Synthesis [56.51475521778443]
RMS-SVS aims to generate high-quality singing voices given realistic music scores with different note types.
We propose RMSSinger, the first RMS-SVS method, which takes realistic music scores as input.
In RMSSinger, we introduce word-level modeling to avoid the time-consuming phoneme duration annotation and the complicated phoneme-level mel-note alignment.
arXiv Detail & Related papers (2023-05-18T03:57:51Z) - Unsupervised Melody-Guided Lyrics Generation [84.22469652275714]
We propose to generate pleasantly listenable lyrics without training on melody-lyric aligned data.
We leverage the crucial alignments between melody and lyrics and compile the given melody into constraints to guide the generation process.
arXiv Detail & Related papers (2023-05-12T20:57:20Z) - AlignSTS: Speech-to-Singing Conversion via Cross-Modal Alignment [67.10208647482109]
The speech-to-singing (STS) voice conversion task aims to generate singing samples corresponding to speech recordings.
This paper proposes AlignSTS, an STS model based on explicit cross-modal alignment.
Experiments show that AlignSTS achieves superior performance in terms of both objective and subjective metrics.
arXiv Detail & Related papers (2023-05-08T06:02:10Z) - A Melody-Unsupervision Model for Singing Voice Synthesis [9.137554315375919]
We propose a melody-unsupervision model that requires only audio-and-lyrics pairs without temporal alignment in training time.
We show that the proposed model is capable of being trained with speech audio and text labels but can generate singing voice in inference time.
arXiv Detail & Related papers (2021-10-13T07:42:35Z) - SongMASS: Automatic Song Writing with Pre-training and Alignment
Constraint [54.012194728496155]
SongMASS is proposed to overcome the challenges of lyric-to-melody generation and melody-to-lyric generation.
It leverages masked sequence to sequence (MASS) pre-training and attention based alignment modeling.
We show that SongMASS generates lyric and melody with significantly better quality than the baseline method.
arXiv Detail & Related papers (2020-12-09T16:56:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.