Related papers: LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning

LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning

URL: http://arxiv.org/abs/2507.04966v1
Date: Mon, 07 Jul 2025 13:09:36 GMT
Title: LAPS-Diff: A Diffusion-Based Framework for Singing Voice Synthesis With Language Aware Prosody-Style Guided Learning
Authors: Sandipan Dhar, Mayank Gupta, Preeti Rao,
Abstract summary: We propose LAPS-Diff, a diffusion model integrated with language-aware embeddings and a vocal-style guided learning mechanism.<n>We curate a Hindi SVS dataset and leverage pre-trained language models to extract word and phone-level embeddings for an enriched lyrics representation.<n>We demonstrate that LAPS-Diff significantly improves the quality of the generated samples compared to the considered state-of-the-art (SOTA) model for our constrained dataset.
Score: 4.573044937555209
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The field of Singing Voice Synthesis (SVS) has seen significant advancements in recent years due to the rapid progress of diffusion-based approaches. However, capturing vocal style, genre-specific pitch inflections, and language-dependent characteristics remains challenging, particularly in low-resource scenarios. To address this, we propose LAPS-Diff, a diffusion model integrated with language-aware embeddings and a vocal-style guided learning mechanism, specifically designed for Bollywood Hindi singing style. We curate a Hindi SVS dataset and leverage pre-trained language models to extract word and phone-level embeddings for an enriched lyrics representation. Additionally, we incorporated a style encoder and a pitch extraction model to compute style and pitch losses, capturing features essential to the naturalness and expressiveness of the synthesized singing, particularly in terms of vocal style and pitch variations. Furthermore, we utilize MERT and IndicWav2Vec models to extract musical and contextual embeddings, serving as conditional priors to refine the acoustic feature generation process further. Based on objective and subjective evaluations, we demonstrate that LAPS-Diff significantly improves the quality of the generated samples compared to the considered state-of-the-art (SOTA) model for our constrained dataset that is typical of the low resource scenario.

Related papers

YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance [16.462715982402884]
Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment.<n>We propose a melody-driven SVS framework capable of synthesizing arbitrary lyrics following any reference melody.<n>Our method builds on a Diffusion Transformer (DiT) architecture, enhanced with a dedicated melody extraction module.
arXiv Detail & Related papers (2025-12-04T13:25:33Z)
Generative Multi-modal Feedback for Singing Voice Synthesis Evaluation [8.659397003532488]
We propose a generative feedback framework that provides multi-dimensional language and audio feedback for singing voice synthesis assessment.<n>Our approach leverages an audio-language model to generate text and audio critiques-covering aspects such as melody, content, and auditory quality.<n>The framework produces musically accurate and interpretable evaluations suitable for guiding generative model improvement.
arXiv Detail & Related papers (2025-12-02T08:32:09Z)
SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture [3.7937714754535503]
SmoothSinger is a conditional diffusion model designed to synthesize high quality and natural singing voices.<n>It refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines.<n> Experiments on the Opencpop dataset, a large-scale Chinese singing corpus, demonstrate that SmoothSinger achieves state-of-the-art results.
arXiv Detail & Related papers (2025-06-26T17:07:45Z)
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.<n>We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.<n>Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z)
On the Semantic Latent Space of Diffusion-Based Text-to-Speech Models [15.068637971987224]
We explore the latent space of frozen TTS models, which is composed of the latent bottleneck activations of the DDM's denoiser. We identify that this space contains rich semantic information, and outline several novel methods for finding semantic directions within it, both supervised and unsupervised. We demonstrate how these enable off-the-shelf audio editing, without any further training, architectural changes or data requirements.
arXiv Detail & Related papers (2024-02-19T16:22:21Z)
StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution. We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z)
StyleSinger: Style Transfer for Out-of-Domain Singing Voice Synthesis [63.18764165357298]
Style transfer for out-of-domain singing voice synthesis (SVS) focuses on generating high-quality singing voices with unseen styles.<n>StyleSinger is the first singing voice synthesis model for zero-shot style transfer of out-of-domain reference singing voice samples.<n>Our evaluations in zero-shot style transfer undeniably establish that StyleSinger outperforms baseline models in both audio quality and similarity to the reference singing voice samples.
arXiv Detail & Related papers (2023-12-17T15:26:16Z)
Enhancing the vocal range of single-speaker singing voice synthesis with melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker. It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice. Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z)
Towards Improving the Expressiveness of Singing Voice Synthesis with BERT Derived Semantic Information [51.02264447897833]
This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings. The proposed SVS system can produce singing voice with higher-quality outperforming VISinger.
arXiv Detail & Related papers (2023-08-31T16:12:01Z)
Karaoker: Alignment-free singing voice synthesis with speech training data [3.9795908407245055]
Karaoker is a multispeaker Tacotron-based model conditioned on voice characteristic features. The model is jointly conditioned with a single deep convolutional encoder on continuous data. We extend the text-to-speech training objective with feature reconstruction, classification and speaker identification tasks.
arXiv Detail & Related papers (2022-04-08T15:33:59Z)
DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.