Related papers: Vocoder-Projected Feature Discriminator

Vocoder-Projected Feature Discriminator

URL: http://arxiv.org/abs/2508.17874v2
Date: Wed, 27 Aug 2025 02:31:12 GMT
Title: Vocoder-Projected Feature Discriminator
Authors: Takuhiro Kaneko, Hirokazu Kameoka, Kou Tanaka, Yuto Kondo,
Abstract summary: In text-to-speech (TTS) and voice conversion (VC), acoustic features, such as mel spectrograms, are typically used as synthesis or conversion targets.<n>We propose a vocoder-projected feature discriminator (VPFD) which uses vocoder features for adversarial training.<n> Experiments on diffusion-based VC distillation demonstrated that a pretrained and frozen vocoder feature extractor is necessary and sufficient to achieve a VC performance comparable to that of waveform discriminators.
Score: 42.55959060773461
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In text-to-speech (TTS) and voice conversion (VC), acoustic features, such as mel spectrograms, are typically used as synthesis or conversion targets owing to their compactness and ease of learning. However, because the ultimate goal is to generate high-quality waveforms, employing a vocoder to convert these features into waveforms and applying adversarial training in the time domain is reasonable. Nevertheless, upsampling the waveform introduces significant time and memory overheads. To address this issue, we propose a vocoder-projected feature discriminator (VPFD), which uses vocoder features for adversarial training. Experiments on diffusion-based VC distillation demonstrated that a pretrained and frozen vocoder feature extractor with a single upsampling step is necessary and sufficient to achieve a VC performance comparable to that of waveform discriminators while reducing the training time and memory consumption by 9.6 and 11.4 times, respectively.

Related papers

Latent-Mark: An Audio Watermark Robust to Neural Resynthesis [62.09761127079914]
Latent-Mark is the first zero-bit audio watermarking framework designed to survive semantic compression.<n>Our key insight is that robustness to the encode-decode process requires embedding the watermark within the invariant latent space.<n>Our work inspires future research into universal watermarking frameworks capable of maintaining integrity across increasingly complex and diverse generative distortions.
arXiv Detail & Related papers (2026-03-05T15:51:09Z)
Audio Signal Processing Using Time Domain Mel-Frequency Wavelet Coefficient [0.0]
We propose a method to extract Mel scale features in time domain combining the concept of wavelet transform.<n>Our proposed Time domain Mel frequency Wavelet Coefficient(TMFWC) technique with the reservoir computing methodology has significantly improved the efficiency of audio signal processing.
arXiv Detail & Related papers (2025-10-28T15:31:52Z)
UniverSR: Unified and Versatile Audio Super-Resolution via Vocoder-Free Flow Matching [20.92242470770289]
We present a framework for audio super-resolution that employs a flow matching generative model to capture the conditional distribution of complex-valued spectral coefficients.<n> Experiments show that our model consistently produces high-fidelity 48 kHz audio across diverse upsampling factors.
arXiv Detail & Related papers (2025-10-01T11:04:53Z)
WaveFM: A High-Fidelity and Efficient Vocoder Based on Flow Matching [1.6385815610837167]
WaveFM is a flow matching model for mel-spectrogram conditioned speech synthesis.<n>Our model achieves superior performance in both quality and efficiency compared to previous diffusion vocoders.
arXiv Detail & Related papers (2025-03-20T20:17:17Z)
Wavetable Synthesis Using CVAE for Timbre Control Based on Semantic Label [2.0124254762298794]
This research introduces a method of timbre control in wavetable synthesis that is intuitive and sensible. Using a conditional variational autoencoder (CVAE), users can select a wavetable and define the timbre with labels such as bright, warm, and rich.
arXiv Detail & Related papers (2024-10-24T10:37:54Z)
Hold Me Tight: Stable Encoder-Decoder Design for Speech Enhancement [1.4037575966075835]
1-D filters on raw audio are hard to train and often suffer from instabilities. We address these problems with hybrid solutions, combining theory-driven and data-driven approaches.
arXiv Detail & Related papers (2024-08-30T15:49:31Z)
PeriodGrad: Towards Pitch-Controllable Neural Vocoder Based on a Diffusion Probabilistic Model [12.292092677396349]
This paper presents a neural vocoder based on a denoising diffusion probabilistic model (DDPM) Our model aims to accurately capture the periodic structure of speech waveforms by incorporating explicit periodic signals. Experimental results show that our model improves sound quality and provides better pitch control than conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2024-02-22T16:47:15Z)
Diffsound: Discrete Diffusion Model for Text-to-sound Generation [78.4128796899781]
We propose a novel text-to-sound generation framework that consists of a text encoder, a Vector Quantized Variational Autoencoder (VQ-VAE), a decoder, and a vocoder. The framework first uses the decoder to transfer the text features extracted from the text encoder to a mel-spectrogram with the help of VQ-VAE, and then the vocoder is used to transform the generated mel-spectrogram into a waveform.
arXiv Detail & Related papers (2022-07-20T15:41:47Z)
SpecGrad: Diffusion Probabilistic Model based Neural Vocoder with Adaptive Noise Spectral Shaping [51.698273019061645]
SpecGrad adapts the diffusion noise so that its time-varying spectral envelope becomes close to the conditioning log-mel spectrogram. It is processed in the time-frequency domain to keep the computational cost almost the same as the conventional DDPM-based neural vocoders.
arXiv Detail & Related papers (2022-03-31T02:08:27Z)
iSTFTNet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform [38.271530231451834]
A mel-spectrogram vocoder must solve three inverse problems: recovery of the original-scale magnitude spectrogram, phase reconstruction, and frequency-to-time conversion. A typical convolutional mel-spectrogram vocoder solves these problems jointly and implicitly using a convolutional neural network. We propose iSTFTNet, which replaces some output-side layers of the mel-spectrogram vocoder with the inverse short-time Fourier transform.
arXiv Detail & Related papers (2022-03-04T16:05:48Z)
VAW-GAN for Singing Voice Conversion with Non-parallel Training Data [81.79070894458322]
We propose a singing voice conversion framework based on VAW-GAN. We train an encoder to disentangle singer identity and singing prosody (F0) from phonetic content. By conditioning on singer identity and F0, the decoder generates output spectral features with unseen target singer identity.
arXiv Detail & Related papers (2020-08-10T09:44:10Z)
Transforming Spectrum and Prosody for Emotional Voice Conversion with Non-Parallel Training Data [91.92456020841438]
Many studies require parallel speech data between different emotional patterns, which is not practical in real life. We propose a CycleGAN network to find an optimal pseudo pair from non-parallel training data. We also study the use of continuous wavelet transform (CWT) to decompose F0 into ten temporal scales, that describes speech prosody at different time resolution.
arXiv Detail & Related papers (2020-02-01T12:36:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.