SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow
- URL: http://arxiv.org/abs/2504.07776v1
- Date: Thu, 10 Apr 2025 14:15:18 GMT
- Title: SlimSpeech: Lightweight and Efficient Text-to-Speech with Slim Rectified Flow
- Authors: Kaidi Wang, Wenhao Guan, Shenghui Lu, Jianglong Yao, Lin Li, Qingyang Hong,
- Abstract summary: We introduce SlimSpeech, a lightweight and efficient speech synthesis system based on rectified flow.<n> Experimental results demonstrate that our proposed method achieves comparable performance to larger models through one-step sampling.
- Score: 12.634298353225455
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, flow matching based speech synthesis has significantly enhanced the quality of synthesized speech while reducing the number of inference steps. In this paper, we introduce SlimSpeech, a lightweight and efficient speech synthesis system based on rectified flow. We have built upon the existing speech synthesis method utilizing the rectified flow model, modifying its structure to reduce parameters and serve as a teacher model. By refining the reflow operation, we directly derive a smaller model with a more straight sampling trajectory from the larger model, while utilizing distillation techniques to further enhance the model performance. Experimental results demonstrate that our proposed method, with significantly reduced model parameters, achieves comparable performance to larger models through one-step sampling.
Related papers
- ECTSpeech: Enhancing Efficient Speech Synthesis via Easy Consistency Tuning [37.55301116117562]
We propose ECTSpeech, a simple and effective one-step synthesis framework.<n>ECTSpeech incorporates the Easy Consistency Tuning (ECT) strategy into speech synthesis.<n>We show that ECTSpeech achieves audio quality comparable to state-of-the-art methods under single-step sampling.
arXiv Detail & Related papers (2025-10-07T14:44:05Z) - AudioTurbo: Fast Text-to-Audio Generation with Rectified Diffusion [23.250409921931492]
Rectified flow enhances inference speed by learning straight-line ordinary differential equation paths.<n>This approach requires training a flow-matching model from scratch and tends to perform suboptimally, or even poorly, at low step counts.<n>We propose AudioTurbo, which learns first-order ODE paths from deterministic noise sample pairs generated by a pre-trained TTA model.
arXiv Detail & Related papers (2025-05-28T08:33:58Z) - Energy-Based Diffusion Language Models for Text Generation [126.23425882687195]
Energy-based Diffusion Language Model (EDLM) is an energy-based model operating at the full sequence level for each diffusion step.<n>Our framework offers a 1.3$times$ sampling speedup over existing diffusion models.
arXiv Detail & Related papers (2024-10-28T17:25:56Z) - DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis [12.310318928818546]
We introduce DMOSpeech, a distilled diffusion-based TTS model that achieves both faster inference and superior performance compared to its teacher model.<n>Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude.<n>This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization.
arXiv Detail & Related papers (2024-10-14T21:17:58Z) - ReTok: Replacing Tokenizer to Enhance Representation Efficiency in Large Language Model [9.1108256816605]
We propose a method to improve model representation and processing efficiency by replacing the tokenizers of large language models (LLMs)
Our method can maintain the performance of the model after replacing the tokenizer, while significantly improving the decoding speed for long texts.
arXiv Detail & Related papers (2024-10-06T03:01:07Z) - VoiceFlow: Efficient Text-to-Speech with Rectified Flow Matching [14.7974342537458]
VoiceFlow is an acoustic model that utilizes a rectified flow matching algorithm to achieve high synthesis quality with a limited number of sampling steps.
Subjective and objective evaluations on both single and multi-speaker corpora showed the superior synthesis quality of VoiceFlow compared to the diffusion counterpart.
arXiv Detail & Related papers (2023-09-10T13:47:39Z) - Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion [85.54515118077825]
This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality.
To reduce computational complexity, LinDiff employs a patch-based processing approach that partitions the input signal into small patches.
Our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed.
arXiv Detail & Related papers (2023-06-09T07:02:43Z) - Delta Tuning: A Comprehensive Study of Parameter Efficient Methods for
Pre-trained Language Models [90.24999406296867]
In contrast with the standard fine-tuning, delta tuning only fine-tunes a small portion of the model parameters while keeping the rest untouched.
Recent studies have demonstrated that a series of delta tuning methods with distinct tuned parameter selection could achieve performance on a par with full- parameter fine-tuning.
arXiv Detail & Related papers (2022-03-14T07:56:32Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Noise Estimation for Generative Diffusion Models [91.22679787578438]
In this work, we present a simple and versatile learning scheme that can adjust the noise parameters for any given number of steps.
Our approach comes at a negligible computation cost.
arXiv Detail & Related papers (2021-04-06T15:46:16Z) - Efficient End-to-End Speech Recognition Using Performers in Conformers [74.71219757585841]
We propose to reduce the complexity of model architectures in addition to model sizes.
The proposed model yields competitive performance on the LibriSpeech corpus with 10 millions of parameters and linear complexity.
arXiv Detail & Related papers (2020-11-09T05:22:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.