DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
- URL: http://arxiv.org/abs/2510.09016v1
- Date: Fri, 10 Oct 2025 05:39:45 GMT
- Title: DiTSinger: Scaling Singing Voice Synthesis with Diffusion Transformer and Implicit Alignment
- Authors: Zongcai Du, Guilin Deng, Xiaofeng Guo, Xin Gao, Linke Li, Kaichang Cheng, Fubo Han, Siyu Yang, Peng Liu, Pan Zhong, Qiang Fu,
- Abstract summary: We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse lyrics, and melody-specific models are trained to synthesize over 500 hours of Chinese singing data.<n>We propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity.
- Score: 13.149605745750245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent progress in diffusion-based Singing Voice Synthesis (SVS) demonstrates strong expressiveness but remains limited by data scarcity and model scalability. We introduce a two-stage pipeline: a compact seed set of human-sung recordings is constructed by pairing fixed melodies with diverse LLM-generated lyrics, and melody-specific models are trained to synthesize over 500 hours of high-quality Chinese singing data. Building on this corpus, we propose DiTSinger, a Diffusion Transformer with RoPE and qk-norm, systematically scaled in depth, width, and resolution for enhanced fidelity. Furthermore, we design an implicit alignment mechanism that obviates phoneme-level duration labels by constraining phoneme-to-acoustic attention within character-level spans, thereby improving robustness under noisy or uncertain alignments. Extensive experiments validate that our approach enables scalable, alignment-free, and high-fidelity SVS.
Related papers
- YingMusic-Singer: Zero-shot Singing Voice Synthesis and Editing with Annotation-free Melody Guidance [16.462715982402884]
Singing Voice Synthesis (SVS) remains constrained in practical deployment due to its strong dependence on accurate phoneme-level alignment.<n>We propose a melody-driven SVS framework capable of synthesizing arbitrary lyrics following any reference melody.<n>Our method builds on a Diffusion Transformer (DiT) architecture, enhanced with a dedicated melody extraction module.
arXiv Detail & Related papers (2025-12-04T13:25:33Z) - SmoothSinger: A Conditional Diffusion Model for Singing Voice Synthesis with Multi-Resolution Architecture [3.7937714754535503]
SmoothSinger is a conditional diffusion model designed to synthesize high quality and natural singing voices.<n>It refines low-quality synthesized audio directly in a unified framework, mitigating the degradation associated with two-stage pipelines.<n> Experiments on the Opencpop dataset, a large-scale Chinese singing corpus, demonstrate that SmoothSinger achieves state-of-the-art results.
arXiv Detail & Related papers (2025-06-26T17:07:45Z) - MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [56.25862714128288]
This paper introduces textitMegaTTS 3, a zero-shot text-to-speech (TTS) system featuring an innovative sparse alignment algorithm.<n>Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space.<n>Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity.
arXiv Detail & Related papers (2025-02-26T08:22:00Z) - MakeSinger: A Semi-Supervised Training Method for Data-Efficient Singing Voice Synthesis via Classifier-free Diffusion Guidance [14.22941848955693]
MakeSinger is a semi-supervised training method for singing voice synthesis.
Our novel dual guiding mechanism gives text and pitch guidance on the reverse diffusion step.
We demonstrate that by adding Text-to-Speech (TTS) data in training, the model can synthesize the singing voices of TTS speakers even without their singing voices.
arXiv Detail & Related papers (2024-06-10T01:47:52Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.<n>We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.<n>Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and
Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN.
We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - Towards Improving the Expressiveness of Singing Voice Synthesis with
BERT Derived Semantic Information [51.02264447897833]
This paper presents an end-to-end high-quality singing voice synthesis (SVS) system that uses bidirectional encoder representation from Transformers (BERT) derived semantic embeddings.
The proposed SVS system can produce singing voice with higher-quality outperforming VISinger.
arXiv Detail & Related papers (2023-08-31T16:12:01Z) - BigVGAN: A Universal Neural Vocoder with Large-Scale Training [49.16254684584935]
We present BigVGAN, a universal vocoder that generalizes well under various unseen conditions in zero-shot setting.
We introduce periodic nonlinearities and anti-aliased representation into the generator, which brings the desired inductive bias for waveform.
We train our GAN vocoder at the largest scale up to 112M parameters, which is unprecedented in the literature.
arXiv Detail & Related papers (2022-06-09T17:56:10Z) - WeSinger: Data-augmented Singing Voice Synthesis with Auxiliary Losses [13.178747366560534]
We develop a new multi-singer Chinese neural singing voice synthesis system named WeSinger.
quantitative and qualitative evaluation results demonstrate the effectiveness of WeSinger in terms of accuracy and naturalness.
arXiv Detail & Related papers (2022-03-21T06:42:44Z) - DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score.
The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.