Related papers: Latent Flow Matching for Expressive Singing Voice Synthesis

Latent Flow Matching for Expressive Singing Voice Synthesis

URL: http://arxiv.org/abs/2601.00217v1
Date: Thu, 01 Jan 2026 05:41:41 GMT
Title: Latent Flow Matching for Expressive Singing Voice Synthesis
Authors: Minhyeok Yun, Yong-Hoon Choi,
Abstract summary: Conditional variational autoencoder (cVAE)-based singing voice synthesis provides efficient inference and strong audio quality.<n>We propose FM-Singer, which introduces conditional flow matching (CFM) in latent space.<n>Experiments on Korean and Chinese singing datasets demonstrate consistent improvements over strong baselines.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Conditional variational autoencoder (cVAE)-based singing voice synthesis provides efficient inference and strong audio quality by learning a score-conditioned prior and a recording-conditioned posterior latent space. However, because synthesis relies on prior samples while training uses posterior latents inferred from real recordings, imperfect distribution matching can cause a prior-posterior mismatch that degrades fine-grained expressiveness such as vibrato and micro-prosody. We propose FM-Singer, which introduces conditional flow matching (CFM) in latent space to learn a continuous vector field transporting prior latents toward posterior latents along an optimal-transport-inspired path. At inference time, the learned latent flow refines a prior sample by solving an ordinary differential equation (ODE) before waveform generation, improving expressiveness while preserving the efficiency of parallel decoding. Experiments on Korean and Chinese singing datasets demonstrate consistent improvements over strong baselines, including lower mel-cepstral distortion and fundamental-frequency error and higher perceptual scores on the Korean dataset. Code, pretrained checkpoints, and audio demos are available at https://github.com/alsgur9368/FM-Singer

Related papers

On the Shape of Latent Variables in a Denoising VAE-MoG: A Posterior Sampling-Based Study [51.56484100374058]
We explore the latent space of a denoising variational autoencoder with a mixture-of-Gaussians prior (VAE-MoG)<n>To evaluate how well the model captures the underlying structure, we use Hamiltonian Monte Carlo (HMC) to draw posterior samples conditioned on clean inputs, and compare them to the encoder's outputs from noisy data.<n>Although the model reconstructs signals accurately, statistical comparisons reveal a clear mismatch in the latent space.
arXiv Detail & Related papers (2025-09-29T18:33:09Z)
WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching [1.6385815610837167]
WaveFM is a flow matching model for mel-spectrogram conditioned speech synthesis.<n>Our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders.
arXiv Detail & Related papers (2025-03-20T20:17:17Z)
DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis [12.310318928818546]
We introduce DMOSpeech, a distilled diffusion-based TTS model that achieves both faster inference and superior performance compared to its teacher model.<n>Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude.<n>This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization.
arXiv Detail & Related papers (2024-10-14T21:17:58Z)
SpecDiff-GAN: A Spectrally-Shaped Noise Diffusion GAN for Speech and Music Synthesis [0.0]
We introduce SpecDiff-GAN, a neural vocoder based on HiFi-GAN. We show the merits of our proposed model for speech and music synthesis on several datasets.
arXiv Detail & Related papers (2024-01-30T09:17:57Z)
DiffuSeq-v2: Bridging Discrete and Continuous Text Spaces for Accelerated Seq2Seq Diffusion Models [58.450152413700586]
We introduce a soft absorbing state that facilitates the diffusion model in learning to reconstruct discrete mutations based on the underlying Gaussian space. We employ state-of-the-art ODE solvers within the continuous space to expedite the sampling process. Our proposed method effectively accelerates the training convergence by 4x and generates samples of similar quality 800x faster.
arXiv Detail & Related papers (2023-10-09T15:29:10Z)
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality. To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches. Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z)
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Driven Adaptive Prior [103.00403682863427]
We propose PriorGrad to improve the efficiency of the conditional diffusion model. We show that PriorGrad achieves a faster convergence leading to data and parameter efficiency and improved quality.
arXiv Detail & Related papers (2021-06-11T14:04:03Z)
DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.