FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation
- URL: http://arxiv.org/abs/2508.17868v1
- Date: Mon, 25 Aug 2025 10:23:24 GMT
- Title: FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation
- Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo,
- Abstract summary: A diffusion-based voice conversion model (VC) model (e.g., VoiceGrad) can achieve high speech quality and speaker similarity.<n>FastVoiceGrad overcomes this limitation by distilling VoiceGrad into a one-step diffusion model.<n>It still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion.
- Score: 42.55959060773461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A diffusion-based voice conversion (VC) model (e.g., VoiceGrad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastVoiceGrad overcomes this limitation by distilling VoiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterVoiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterVoiceGrad achieves competitive VC performance compared to FastVoiceGrad, with 6.6-6.9 and 1.8 times faster speed on a GPU and CPU, respectively.
Related papers
- READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z) - One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models [65.96186414865747]
Text-to-Image (T2I) diffusion models face a trade-off between inference speed and image quality.<n>We introduce the first Time-independent Unified TiUE for the student model UNet architecture.<n>Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling.
arXiv Detail & Related papers (2025-05-28T04:23:22Z) - FastVoiceGrad: One-step Diffusion-Based Voice Conversion with Adversarial Conditional Diffusion Distillation [28.847324588324152]
We propose FastVoiceGrad, a one-step diffusion-based VC that reduces the number of iterations from dozens to one.
FastVoiceGrad achieves superior to or comparable to that of previous multi-step diffusion-based VC while enhancing the inference speed.
arXiv Detail & Related papers (2024-09-03T19:19:48Z) - Frieren: Efficient Video-to-Audio Generation Network with Rectified Flow Matching [51.70360630470263]
Video-to-audio (V2A) generation aims to synthesize content-matching audio from silent video.<n>We propose Frieren, a V2A model based on rectified flow matching.<n>Experiments indicate that Frieren achieves state-of-the-art performance in both generation quality and temporal alignment.
arXiv Detail & Related papers (2024-06-01T06:40:22Z) - CoMoSVC: Consistency Model-based Singing Voice Conversion [40.08004069518143]
We propose CoMoSVC, a consistency model-based Singing Voice Conversion method.
CoMoSVC aims to achieve both high-quality generation and high-speed sampling.
Experiments on a single NVIDIA GTX4090 GPU reveal that CoMoSVC has a significantly faster inference speed than the state-of-the-art (SOTA) diffusion-based SVC system.
arXiv Detail & Related papers (2024-01-03T15:47:17Z) - Faster Diffusion: Rethinking the Role of the Encoder for Diffusion Model Inference [95.42299246592756]
We study the UNet encoder and empirically analyze the encoder features.
We find that encoder features change minimally, whereas the decoder features exhibit substantial variations across different time-steps.
We validate our approach on other tasks: text-to-video, personalized generation and reference-guided generation.
arXiv Detail & Related papers (2023-12-15T08:46:43Z) - CoMoSpeech: One-Step Speech and Singing Voice Synthesis via Consistency
Model [41.21042900853639]
We propose a "Co"nsistency "Mo"del-based "Speech" synthesis method, CoMoSpeech, which achieve speech synthesis through a single diffusion sampling step.
By generating audio recordings by a single sampling step, the CoMoSpeech achieves an inference speed more than 150 times faster than real-time.
arXiv Detail & Related papers (2023-05-11T15:51:46Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.