Maximum Voiced Frequency Estimation: Exploiting Amplitude and Phase
Spectra
- URL: http://arxiv.org/abs/2006.00521v1
- Date: Sun, 31 May 2020 13:40:46 GMT
- Title: Maximum Voiced Frequency Estimation: Exploiting Amplitude and Phase
Spectra
- Authors: Thomas Drugman, Yannis Stylianou
- Abstract summary: This paper proposes a new approach for MVF estimation which exploits both amplitude and phase spectra.
It is shown that phase conveys relevant information about the harmonicity of the voice signal, and that it can be jointly used with features derived from the amplitude spectrum.
The proposed technique is compared to two state-of-the-art methods, and shows a superior performance in both objective and subjective evaluations.
- Score: 22.675699190161417
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Maximum Voiced Frequency (MVF) is used in various speech models as the
spectral boundary separating periodic and aperiodic components during the
production of voiced sounds. Recent studies have shown that its proper
estimation and modeling enhance the quality of statistical parametric speech
synthesizers. Contrastingly, these same methods of MVF estimation have been
reported to degrade the performance of singing voice synthesizers. This paper
proposes a new approach for MVF estimation which exploits both amplitude and
phase spectra. It is shown that phase conveys relevant information about the
harmonicity of the voice signal, and that it can be jointly used with features
derived from the amplitude spectrum. This information is further integrated
into a maximum likelihood criterion which provides a decision about the MVF
estimate. The proposed technique is compared to two state-of-the-art methods,
and shows a superior performance in both objective and subjective evaluations.
Perceptual tests indicate a drastic improvement in high-pitched voices.
Related papers
- High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Adversarial Training of Denoising Diffusion Model Using Dual
Discriminators for High-Fidelity Multi-Speaker TTS [0.0]
The diffusion model is capable of generating high-quality data through a probabilistic approach.
It suffers from the drawback of slow generation speed due to the requirement of a large number of time steps.
We propose a speech synthesis model with two discriminators: a diffusion discriminator for learning the distribution of the reverse process and a spectrogram discriminator for learning the distribution of the generated data.
arXiv Detail & Related papers (2023-08-03T07:22:04Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - A Study on Speech Enhancement Based on Diffusion Probabilistic Model [63.38586161802788]
We propose a diffusion probabilistic model-based speech enhancement model (DiffuSE) model that aims to recover clean speech signals from noisy signals.
The experimental results show that DiffuSE yields performance that is comparable to related audio generative models on the standardized Voice Bank corpus task.
arXiv Detail & Related papers (2021-07-25T19:23:18Z) - Glottal source estimation robustness: A comparison of sensitivity of
voice source estimation techniques [11.97036509133719]
This paper addresses the problem of estimating the voice source directly from speech waveforms.
A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase.
arXiv Detail & Related papers (2020-05-24T08:13:47Z) - Mutual Information Maximization for Effective Lip Reading [99.11600901751673]
We propose to introduce the mutual information constraints on both the local feature's level and the global sequence's level.
By combining these two advantages together, the proposed method is expected to be both discriminative and robust for effective lip reading.
arXiv Detail & Related papers (2020-03-13T18:47:42Z) - The Deterministic plus Stochastic Model of the Residual Signal and its
Applications [13.563526970105988]
This manuscript presents a Deterministic plus Model (DSM) of the residual signal.
The applicability of the DSM in two fields of speech processing is then studied.
arXiv Detail & Related papers (2019-12-29T07:52:37Z) - A Deterministic plus Stochastic Model of the Residual Signal for
Improved Parametric Speech Synthesis [11.481208551940998]
We propose an adaptation of the Deterministic plus Model (DSM) for the residual.
The proposed residual model is integrated within a HMM-based speech synthesizer.
Results show a significative improvement for both male and female voices.
arXiv Detail & Related papers (2019-12-29T07:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.