MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows
- URL: http://arxiv.org/abs/2602.18104v1
- Date: Fri, 20 Feb 2026 09:48:23 GMT
- Title: MeanVoiceFlow: One-step Nonparallel Voice Conversion with Mean Flows
- Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo,
- Abstract summary: MeanVoiceFlow is a one-step nonparallel VC model based on mean flows.<n>MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models.
- Score: 42.55959060773461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In voice conversion (VC) applications, diffusion and flow-matching models have exhibited exceptional speech quality and speaker similarity performances. However, they are limited by slow conversion owing to their iterative inference. Consequently, we propose MeanVoiceFlow, a novel one-step nonparallel VC model based on mean flows, which can be trained from scratch without requiring pretraining or distillation. Unlike conventional flow matching that uses instantaneous velocity, mean flows employ average velocity to more accurately compute the time integral along the inference path in a single step. However, training the average velocity requires its derivative to compute the target velocity, which can cause instability. Therefore, we introduce a structural margin reconstruction loss as a zero-input constraint, which moderately regularizes the input-output behavior of the model without harmful statistical averaging. Furthermore, we propose conditional diffused-input training in which a mixture of noise and source data is used as input to the model during both training and inference. This enables the model to effectively leverage source information while maintaining consistency between training and inference. Experimental results validate the effectiveness of these techniques and demonstrate that MeanVoiceFlow achieves performance comparable to that of previous multi-step and distillation-based models, even when trained from scratch. Audio samples are available at https://www.kecl.ntt.co.jp/people/kaneko.takuhiro/projects/meanvoiceflow/.
Related papers
- FastFlow: Accelerating The Generative Flow Matching Models with Bandit Inference [10.34801095627052]
Flow-matching models deliver state-of-the-art fidelity in image and video generation, but the inherent sequential denoising process renders them slower.<n>We propose FastFlow, a plug-and-play adaptive inference framework that accelerates generation in flow matching models.<n> Experiments demonstrate a speedup of over 2.6x while maintaining high-quality outputs.
arXiv Detail & Related papers (2026-02-11T18:21:11Z) - Compose Yourself: Average-Velocity Flow Matching for One-Step Speech Enhancement [46.23750572308065]
COSE is a one-step FM framework tailored for speech enhancement.<n>We introduce a velocity composition identity to compute average velocity efficiently.<n>Experiments show that COSE delivers up to 5x faster sampling and reduces training cost by 40%.
arXiv Detail & Related papers (2025-09-19T13:07:39Z) - MeanFlowSE: one-step generative speech enhancement via conditional mean flow [13.437825847370442]
MeanFlowSE is a conditional generative model that learns the average velocity over finite intervals along a trajectory.<n>On VoiceBank-DEMAND, the single-step model achieves strong intelligibility, fidelity, and perceptual quality with substantially lower computational cost than multistep baselines.
arXiv Detail & Related papers (2025-09-18T11:24:47Z) - Contrastive Flow Matching [61.60002028726023]
We introduce Contrastive Flow Matching, an extension to the flow matching objective that explicitly enforces uniqueness across all conditional flows.<n>Our approach adds a contrastive objective that maximizes dissimilarities between predicted flows from arbitrary sample pairs.<n>We find that training models with Contrastive Flow Matching (1) improves training speed by a factor of up to 9x, (2) requires up to 5x fewer de-noising steps and (3) lowers FID by up to 8.9 compared to training the same models with flow matching.
arXiv Detail & Related papers (2025-06-05T17:59:58Z) - AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion [23.250409921931492]
Rectified flow enhances inference speed by learning straight-line ordinary differential equation paths.<n>This approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts.<n>We propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model.
arXiv Detail & Related papers (2025-05-28T08:33:58Z) - Mean Flows for One-step Generative Modeling [64.4997821467102]
We propose a principled and effective framework for one-step generative modeling.<n>A well-defined identity between average and instantaneous velocities is derived and used to guide neural network training.<n>Our method, termed the MeanFlow model, is self-contained and requires no pre-training, distillation, or curriculum learning.
arXiv Detail & Related papers (2025-05-19T17:59:42Z) - Fast constrained sampling in pre-trained diffusion models [80.99262780028015]
We propose an algorithm that enables fast, high-quality generation under arbitrary constraints.<n>Our approach produces results that rival or surpass the state-of-the-art training-free inference methods.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - Improving Consistency Models with Generator-Augmented Flows [16.049476783301724]
Consistency models imitate the multi-step sampling of score-based diffusion in a single forward pass of a neural network.<n>They can be learned in two ways: consistency distillation and consistency training.<n>We propose a novel flow that transports noisy data towards their corresponding outputs derived from a consistency model.
arXiv Detail & Related papers (2024-06-13T20:22:38Z) - Language Rectified Flow: Advancing Diffusion Language Generation with Probabilistic Flows [53.31856123113228]
This paper proposes Language Rectified Flow (ours)
Our method is based on the reformulation of the standard probabilistic flow models.
Experiments and ablation studies demonstrate that our method can be general, effective, and beneficial for many NLP tasks.
arXiv Detail & Related papers (2024-03-25T17:58:22Z) - Guided Flows for Generative Modeling and Decision Making [55.42634941614435]
We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text synthesis-to-speech.
Notably, we are first to apply flow models for plan generation in the offline reinforcement learning setting ax speedup in compared to diffusion models.
arXiv Detail & Related papers (2023-11-22T15:07:59Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.