Related papers: RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching

URL: http://arxiv.org/abs/2506.16741v1
Date: Fri, 20 Jun 2025 04:19:29 GMT
Title: RapFlow-TTS: Rapid and High-Fidelity Text-to-Speech with Improved Consistency Flow Matching
Authors: Hyun Joon Park, Jeongmin Liu, Jin Sob Kim, Jeong Yeol Yang, Sung Won Han, Eunwoo Song,
Abstract summary: RapFlow-TTS is a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training.<n>We show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches.
Score: 9.197146332563461
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We introduce RapFlow-TTS, a rapid and high-fidelity TTS acoustic model that leverages velocity consistency constraints in flow matching (FM) training. Although ordinary differential equation (ODE)-based TTS generation achieves natural-quality speech, it typically requires a large number of generation steps, resulting in a trade-off between quality and inference speed. To address this challenge, RapFlow-TTS enforces consistency in the velocity field along the FM-straightened ODE trajectory, enabling consistent synthetic quality with fewer generation steps. Additionally, we introduce techniques such as time interval scheduling and adversarial learning to further enhance the quality of the few-step synthesis. Experimental results show that RapFlow-TTS achieves high-fidelity speech synthesis with a 5- and 10-fold reduction in synthesis steps than the conventional FM- and score-based approaches, respectively.

Related papers

DSFlow: Dual Supervision and Step-Aware Architecture for One-Step Flow Matching Speech Synthesis [11.529725810139281]
Flow-matching models have enabled high-quality text-to-speech synthesis, but their iterative sampling process during inference incurs substantial computational cost.<n>We introduce DSFlow, a modular distillation framework for few-step and one-step synthesis.
arXiv Detail & Related papers (2026-02-03T03:57:12Z)
Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech [2.5964779217812057]
Flamed-TTS is a novel zero-shot Text-to-Speech framework that emphasizes low computational cost, low latency, and high speech fidelity alongside rich temporal diversity.<n>We show that Flamed-TTS surpasses state-of-the-art models in terms of intelligibility, naturalness, speaker similarity, acoustic characteristics preservation, and dynamic pace.
arXiv Detail & Related papers (2025-10-03T09:36:55Z)
Shortcut Flow Matching for Speech Enhancement: Step-Invariant flows via single stage training [20.071957855504206]
Diffusion-based generative models have achieved state-of-the-art performance for perceptual quality in speech enhancement.<n>We introduce Shortcut Flow Matching for Speech Enhancement (SFMSE), a novel approach that trains a single, step-invariant model.<n>Our results demonstrate that a single-step SFMSE inference achieves a real-time factor (RTF) of 0.013 on a consumer GPU.
arXiv Detail & Related papers (2025-09-25T20:09:05Z)
MeanFlow-Accelerated Multimodal Video-to-Audio Synthesis via One-Step Generation [12.665130073406651]
A key challenge in synthesizing audios from silent videos is the inherent trade-off between synthesis quality and inference efficiency.<n>We introduce a MeanFlow-accelerated model that characterizes flow fields using average velocity.<n>We demonstrate that incorporating MeanFlow into the network significantly improves inference speed without compromising perceptual quality.
arXiv Detail & Related papers (2025-09-08T07:15:21Z)
Zero-Shot Streaming Text to Speech Synthesis with Transducer and Auto-Regressive Modeling [76.23539797803681]
Existing methods primarily use a look mechanism, relying on future text to achieve natural streaming speech synthesis.<n>We propose LE, a streaming framework for generating high-quality speech frame-by-frame.<n> Experimental results suggest that the LE outperforms current streaming TTS methods and achieves comparable performance over sentence-level TTS systems.
arXiv Detail & Related papers (2025-05-26T08:25:01Z)
FELLE: Autoregressive Speech Synthesis with Token-Wise Coarse-to-Fine Flow Matching [51.32059240975148]
FELLE is an autoregressive model that integrates language modeling with token-wise flow matching.<n>For each continuous-valued token, FELLE modifies the general prior distribution in flow matching by incorporating information from the previous step.<n>FELLE generates continuous-valued tokens hierarchically, conditioned on the language model's output.
arXiv Detail & Related papers (2025-02-16T13:54:32Z)
Consistency Flow Matching: Defining Straight Flows with Velocity Consistency [97.28511135503176]
We introduce Consistency Flow Matching (Consistency-FM), a novel FM method that explicitly enforces self-consistency in the velocity field. Preliminary experiments demonstrate that our Consistency-FM significantly improves training efficiency by converging 4.4x faster than consistency models.
arXiv Detail & Related papers (2024-07-02T16:15:37Z)
CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models [30.68516200579894]
We introduce CM-TTS, a novel architecture grounded in consistency models (CMs) CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations.
arXiv Detail & Related papers (2024-03-31T05:38:08Z)
Consistency Trajectory Models: Learning Probability Flow ODE Trajectory of Diffusion [56.38386580040991]
Consistency Trajectory Model (CTM) is a generalization of Consistency Models (CM) CTM enables the efficient combination of adversarial training and denoising score matching loss to enhance performance. Unlike CM, CTM's access to the score function can streamline the adoption of established controllable/conditional generation methods.
arXiv Detail & Related papers (2023-10-01T05:07:17Z)
ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation [21.335983674309475]
Diffusion models suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. We introduce ConsistencyTTA, a framework requiring only a single non-autoregressive network query. We achieve so by proposing "CFG-aware latent consistency model," which adapts consistency generation into a latent space.
arXiv Detail & Related papers (2023-09-19T16:36:33Z)
Matcha-TTS: A fast TTS architecture with conditional flow matching [13.973500393046235]
We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling. It is trained using optimal-transport conditional flow matching (OT-CFM) This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching.
arXiv Detail & Related papers (2023-09-06T17:59:57Z)
Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality. To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches. Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z)
Time-triggered Federated Learning over Wireless Networks [48.389824560183776]
We present a time-triggered FL algorithm (TT-Fed) over wireless networks. Our proposed TT-Fed algorithm improves the converged test accuracy by up to 12.5% and 5%, respectively.
arXiv Detail & Related papers (2022-04-26T16:37:29Z)
FastDiff: A Fast Conditional Diffusion Model for High-Quality Speech Synthesis [90.3069686272524]
This paper proposes FastDiff, a fast conditional diffusion model for high-quality speech synthesis. FastDiff employs a stack of time-aware location-variable convolutions of diverse receptive field patterns to efficiently model long-term time dependencies. Based on FastDiff, we design an end-to-end text-to-speech synthesizer, FastDiff-TTS, which generates high-fidelity speech waveforms.
arXiv Detail & Related papers (2022-04-21T07:49:09Z)
Differentiable Duration Modeling for End-to-End Text-to-Speech [6.571447892202893]
parallel text-to-speech (TTS) models have recently enabled fast and highly-natural speech synthesis. We propose a differentiable duration method for learning monotonic sequences between input and output. Our model learns to perform high-fidelity synthesis through a combination of adversarial training and matching the total ground-truth duration.
arXiv Detail & Related papers (2022-03-21T15:14:44Z)
Diff-TTS: A Denoising Diffusion Model for Text-to-Speech [14.231478930274058]
We propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. We verify that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.
arXiv Detail & Related papers (2021-04-03T13:53:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.