Related papers: A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis

URL: http://arxiv.org/abs/2406.12164v2
Date: Tue, 9 Jul 2024 18:21:48 GMT
Title: A Mel Spectrogram Enhancement Paradigm Based on CWT in Speech Synthesis
Authors: Guoqiang Hu, Huaning Tan, Ruilai Li,
Abstract summary: We propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT) This paradigm introduces a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively.
Score: 3.9940425551415597
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Acoustic features play an important role in improving the quality of the synthesised speech. Currently, the Mel spectrogram is a widely employed acoustic feature in most acoustic models. However, due to the fine-grained loss caused by its Fourier transform process, the clarity of speech synthesised by Mel spectrogram is compromised in mutant signals. In order to obtain a more detailed Mel spectrogram, we propose a Mel spectrogram enhancement paradigm based on the continuous wavelet transform (CWT). This paradigm introduces an additional task: a more detailed wavelet spectrogram, which like the post-processing network takes as input the Mel spectrogram output by the decoder. We choose Tacotron2 and Fastspeech2 for experimental validation in order to test autoregressive (AR) and non-autoregressive (NAR) speech systems, respectively. The experimental results demonstrate that the speech synthesised using the model with the Mel spectrogram enhancement paradigm exhibits higher MOS, with an improvement of 0.14 and 0.09 compared to the baseline model, respectively. These findings provide some validation for the universality of the enhancement paradigm, as they demonstrate the success of the paradigm in different architectures.

Related papers

Prosody-Guided Harmonic Attention for Phase-Coherent Neural Vocoding in the Complex Spectrum [1.3066182802188198]
We introduce prosody-guided harmonic attention to enhance voiced segment encoding and directly predict complex spectral components for waveform synthesis via inverse STFT.<n>Experiments on benchmark datasets demonstrate consistent gains over HiFi-GAN and AutoVocoder: F0 RMSE reduced by 22 percent, voiced/unvoiced error lowered by 18 percent, and MOS scores improved by 0.15.<n>These results show that prosody-guided attention combined with direct complex spectrum modeling yields more natural, pitch-accurate, and robust synthetic speech, setting a strong foundation for expressive neural vocoding.
arXiv Detail & Related papers (2026-01-20T20:53:24Z)
MARS: Audio Generation via Multi-Channel Autoregression on Spectrograms [0.8258451067861929]
We introduce MARS (Multi-channel AutoRegression on Spectrograms), a framework that treats spectrograms as multi-channel images.<n>A shared tokenizer provides consistent discrete representations across scales, enabling a transformer-based autoregressor to refine spectrograms efficiently.
arXiv Detail & Related papers (2025-09-30T09:38:02Z)
CleanMel: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR [36.77663840488492]
We propose CleanMel, a single-channel Mel-spectrogram denoising and dereverberation network. The proposed network takes as input the noisy and reverberant microphone recording and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to speech waveform with a neural vocoder or directly used for ASR.
arXiv Detail & Related papers (2025-02-27T12:28:29Z)
SiFiSinger: A High-Fidelity End-to-End Singing Voice Synthesizer based on Source-filter Model [31.280358048556444]
This paper presents an advanced end-to-end singing voice synthesis (SVS) system based on the source-filter mechanism. The proposed system also incorporates elements like the fundamental pitch (F0) predictor and waveform generation decoder. Experiments on the Opencpop dataset demonstrate efficacy of the proposed model in intonation quality and accuracy.
arXiv Detail & Related papers (2024-10-16T13:18:45Z)
Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS) MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Towards Robust FastSpeech 2 by Modelling Residual Multimodality [4.4904382374090765]
State-of-the-art non-autoregressive text-to-speech models based on FastSpeech 2 can efficiently synthesise high-fidelity and natural speech. We observe characteristic audio distortions in expressive speech datasets. TVC-GMM reduces spectrogram smoothness and improves perceptual audio quality in particular for expressive datasets.
arXiv Detail & Related papers (2023-06-02T11:03:26Z)
MAST: Multiscale Audio Spectrogram Transformers [53.06337011259031]
We present Multiscale Audio Spectrogram Transformer (MAST) for audio classification, which brings the concept of multiscale feature hierarchies to the Audio Spectrogram Transformer (AST) In practice, MAST significantly outperforms AST by an average accuracy of 3.4% across 8 speech and non-speech tasks from the LAPE Benchmark.
arXiv Detail & Related papers (2022-11-02T23:34:12Z)
Audio Deepfake Detection Based on a Combination of F0 Information and Real Plus Imaginary Spectrogram Features [51.924340387119415]
Experimental results on the ASVspoof 2019 LA dataset show that our proposed system is very effective for the audio deepfake detection task. Our proposed system is very effective for the audio deepfake detection task, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.
arXiv Detail & Related papers (2022-08-02T02:46:16Z)
DiffSinger: Diffusion Acoustic Model for Singing Voice Synthesis [53.19363127760314]
DiffSinger is a parameterized Markov chain which iteratively converts the noise into mel-spectrogram conditioned on the music score. The evaluations conducted on the Chinese singing dataset demonstrate that DiffSinger outperforms state-of-the-art SVS work with a notable margin.
arXiv Detail & Related papers (2021-05-06T05:21:42Z)
Universal MelGAN: A Robust Neural Vocoder for High-Fidelity Waveform Generation in Multiple Domains [1.8047694351309207]
We propose Universal MelGAN, a vocoder that synthesizes high-fidelity speech in multiple domains. MelGAN-based structure is trained with a dataset of hundreds of speakers. We added multi-resolution spectrogram discriminators to sharpen the spectral resolution of the generated waveforms.
arXiv Detail & Related papers (2020-11-19T03:35:45Z)
Unsupervised Cross-Domain Speech-to-Speech Conversion with Time-Frequency Consistency [14.062850439230111]
We propose a condition encouraging spectrogram consistency during the adversarial training procedure. Our experimental results on the Librispeech corpus show that the model trained with the TF consistency provides a perceptually better quality of speech-to-speech conversion.
arXiv Detail & Related papers (2020-05-15T22:27:07Z)
VaPar Synth -- A Variational Parametric Model for Audio Synthesis [78.3405844354125]
We present VaPar Synth - a Variational Parametric Synthesizer which utilizes a conditional variational autoencoder (CVAE) trained on a suitable parametric representation. We demonstrate our proposed model's capabilities via the reconstruction and generation of instrumental tones with flexible control over their pitch.
arXiv Detail & Related papers (2020-03-30T16:05:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.