DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis
- URL: http://arxiv.org/abs/2602.09041v1
- Date: Tue, 03 Feb 2026 03:57:12 GMT
- Title: DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis
- Authors: Bin Lin, Peng Yang, Chao Yan, Xiaochen Liu, Wei Wang, Boyong Wu, Pengfei Tan, Xuerui Yang,
- Abstract summary: Flow-matching models have enabled high-quality text-to-speech synthesis, but their iterative sampling process during inference incurs substantial computational cost.<n>We introduce DSFlow, a modular distillation framework for few-step and one-step synthesis.
- Score: 11.529725810139281
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Flow-matching models have enabled high-quality text-to-speech synthesis, but their iterative sampling process during inference incurs substantial computational cost. Although distillation is widely used to reduce the number of inference steps, existing methods often suffer from process variance due to endpoint error accumulation. Moreover, directly reusing continuous-time architectures for discrete, fixed-step generation introduces structural parameter inefficiencies. To address these challenges, we introduce DSFlow, a modular distillation framework for few-step and one-step synthesis. DSFlow reformulates generation as a discrete prediction task and explicitly adapts the student model to the target inference regime. It improves training stability through a dual supervision strategy that combines endpoint matching with deterministic mean-velocity alignment, enforcing consistent generation trajectories across inference steps. In addition, DSFlow improves parameter efficiency by replacing continuous-time timestep conditioning with lightweight step-aware tokens, aligning model capacity with the significantly reduced timestep space of the discrete task. Extensive experiments across diverse flow-based text-to-speech architectures demonstrate that DSFlow consistently outperforms standard distillation approaches, achieving strong few-step and one-step synthesis quality while reducing model parameters and inference cost.
Related papers
- FlowConsist: Make Your Flow Consistent with Real Trajectory [99.22869983378062]
We argue that current fast-flow training paradigms suffer from two fundamental issues.<n> conditional velocities constructed from randomly paired noise-data samples introduce systematic trajectory drift.<n>We propose FlowConsist, a training framework designed to enforce trajectory consistency in fast flows.
arXiv Detail & Related papers (2026-02-06T03:24:23Z) - Temporal Pair Consistency for Variance-Reduced Flow Matching [13.328987133593154]
Temporal Pair Consistency (TPC) is a lightweight variance-reduction principle that couples velocity predictions at paired timesteps along the same probability path.<n>Instantiated within flow matching, TPC improves sample quality and efficiency across CIFAR-10 and ImageNet at multiple resolutions.
arXiv Detail & Related papers (2026-02-04T00:05:21Z) - Edit-Based Flow Matching for Temporal Point Processes [51.33476564706644]
temporal point processes (TPPs) are a fundamental tool for modeling event sequences in continuous time.<n>Recent non-autoregressive, diffusion-style models mitigate these issues by jointly interpolating between noise and data.<n>We introduce an Edit Flow process for TPPs that transports noise to data via insert, delete, and substitute edit operations.
arXiv Detail & Related papers (2025-10-07T15:44:12Z) - Shortcut Flow Matching for Speech Enhancement: Step-Invariant flows via single stage training [20.071957855504206]
Diffusion-based generative models have achieved state-of-the-art performance for perceptual quality in speech enhancement.<n>We introduce Shortcut Flow Matching for Speech Enhancement (SFMSE), a novel approach that trains a single, step-invariant model.<n>Our results demonstrate that a single-step SFMSE inference achieves a real-time factor (RTF) of 0.013 on a consumer GPU.
arXiv Detail & Related papers (2025-09-25T20:09:05Z) - MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation [12.665130073406651]
A key challenge in synthesizing audios from silent videos is the inherent trade-off between synthesis quality and inference efficiency.<n>We introduce a MeanFlow-accelerated model that characterizes flow fields using average velocity.<n>We demonstrate that incorporating MeanFlow into the network significantly improves inference speed without compromising perceptual quality.
arXiv Detail & Related papers (2025-09-08T07:15:21Z) - SwiftVideo: A Unified Framework for Few-Step Video Generation through Trajectory-Distribution Alignment [76.60024640625478]
Diffusion-based or flow-based models have achieved significant progress in video synthesis but require multiple iterative sampling steps.<n>We propose a unified and stable distillation framework that combines the advantages of trajectory-preserving and distribution-matching strategies.<n>Our method maintains high-quality video generation while substantially reducing the number of inference steps.
arXiv Detail & Related papers (2025-08-08T07:26:34Z) - ReDi: Rectified Discrete Flow [17.72385262464804]
We analyze the factorization approximation error using Conditional Total Correlation (TC)<n>We propose Rectified Discrete Flow (ReDi), a novel iterative method that reduces the underlying factorization error.<n> Empirically, ReDi significantly reduces Conditional TC and enables few-step generation.
arXiv Detail & Related papers (2025-07-21T01:18:44Z) - Lightweight Task-Oriented Semantic Communication Empowered by Large-Scale AI Models [66.57755931421285]
Large-scale artificial intelligence (LAI) models pose significant challenges for real-time communication scenarios.<n>This paper proposes utilizing knowledge distillation (KD) techniques to extract and condense knowledge from LAI models.<n>We propose a fast distillation method featuring a pre-stored compression mechanism that eliminates the need for repetitive inference.
arXiv Detail & Related papers (2025-06-16T08:42:16Z) - Nesterov Method for Asynchronous Pipeline Parallel Optimization [59.79227116582264]
We introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in Pipeline Parallelism.<n>Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients.<n>We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients.
arXiv Detail & Related papers (2025-05-02T08:23:29Z) - Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening [10.23957420290553]
We propose the Optimal Transport Flow Matching framework to achieve one-step, high-quality pansharpening.<n>The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints.
arXiv Detail & Related papers (2025-03-19T08:10:49Z) - Deep Equilibrium Optical Flow Estimation [80.80992684796566]
Recent state-of-the-art (SOTA) optical flow models use finite-step recurrent update operations to emulate traditional algorithms.
These RNNs impose large computation and memory overheads, and are not directly trained to model such stable estimation.
We propose deep equilibrium (DEQ) flow estimators, an approach that directly solves for the flow as the infinite-level fixed point of an implicit layer.
arXiv Detail & Related papers (2022-04-18T17:53:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.