The Deterministic plus Stochastic Model of the Residual Signal and its
Applications
- URL: http://arxiv.org/abs/2001.01000v1
- Date: Sun, 29 Dec 2019 07:52:37 GMT
- Title: The Deterministic plus Stochastic Model of the Residual Signal and its
Applications
- Authors: Thomas Drugman, Thierry Dutoit
- Abstract summary: This manuscript presents a Deterministic plus Model (DSM) of the residual signal.
The applicability of the DSM in two fields of speech processing is then studied.
- Score: 13.563526970105988
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The modeling of speech production often relies on a source-filter approach.
Although methods parameterizing the filter have nowadays reached a certain
maturity, there is still a lot to be gained for several speech processing
applications in finding an appropriate excitation model. This manuscript
presents a Deterministic plus Stochastic Model (DSM) of the residual signal.
The DSM consists of two contributions acting in two distinct spectral bands
delimited by a maximum voiced frequency. Both components are extracted from an
analysis performed on a speaker-dependent dataset of pitch-synchronous residual
frames. The deterministic part models the low-frequency contents and arises
from an orthonormal decomposition of these frames. As for the stochastic
component, it is a high-frequency noise modulated both in time and frequency.
Some interesting phonetic and computational properties of the DSM are also
highlighted. The applicability of the DSM in two fields of speech processing is
then studied. First, it is shown that incorporating the DSM vocoder in
HMM-based speech synthesis enhances the delivered quality. The proposed
approach turns out to significantly outperform the traditional pulse excitation
and provides a quality equivalent to STRAIGHT. In a second application, the
potential of glottal signatures derived from the proposed DSM is investigated
for speaker identification purpose. Interestingly, these signatures are shown
to lead to better recognition rates than other glottal-based methods.
Related papers
- High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Dynamic Spectrum Mixer for Visual Recognition [17.180863898764194]
We propose a content-adaptive yet computationally efficient structure, dubbed Dynamic Spectrum Mixer (DSM)
DSM represents token interactions in the frequency domain by employing the Cosine Transform.
It can learn long-term spatial dependencies with log-linear complexity.
arXiv Detail & Related papers (2023-09-13T04:51:15Z) - From Discrete Tokens to High-Fidelity Audio Using Multi-Band Diffusion [84.138804145918]
Deep generative models can generate high-fidelity audio conditioned on various types of representations.
These models are prone to generate audible artifacts when the conditioning is flawed or imperfect.
We propose a high-fidelity multi-band diffusion-based framework that generates any type of audio modality from low-bitrate discrete representations.
arXiv Detail & Related papers (2023-08-02T22:14:29Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Generalizing Face Forgery Detection with High-frequency Features [63.33397573649408]
Current CNN-based detectors tend to overfit to method-specific color textures and thus fail to generalize.
We propose to utilize the high-frequency noises for face forgery detection.
The first is the multi-scale high-frequency feature extraction module that extracts high-frequency noises at multiple scales.
The second is the residual-guided spatial attention module that guides the low-level RGB feature extractor to concentrate more on forgery traces from a new perspective.
arXiv Detail & Related papers (2021-03-23T08:19:21Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Maximum Voiced Frequency Estimation: Exploiting Amplitude and Phase
Spectra [22.675699190161417]
This paper proposes a new approach for MVF estimation which exploits both amplitude and phase spectra.
It is shown that phase conveys relevant information about the harmonicity of the voice signal, and that it can be jointly used with features derived from the amplitude spectrum.
The proposed technique is compared to two state-of-the-art methods, and shows a superior performance in both objective and subjective evaluations.
arXiv Detail & Related papers (2020-05-31T13:40:46Z) - Audio-Visual Decision Fusion for WFST-based and seq2seq Models [3.2771898634434997]
Under noisy conditions, speech recognition systems suffer from high Word Error Rates (WER)
We propose novel methods to fuse information from audio and visual modalities at inference time.
We show that our methods give significant improvements over acoustic-only WER.
arXiv Detail & Related papers (2020-01-29T13:45:08Z) - A Deterministic plus Stochastic Model of the Residual Signal for
Improved Parametric Speech Synthesis [11.481208551940998]
We propose an adaptation of the Deterministic plus Model (DSM) for the residual.
The proposed residual model is integrated within a HMM-based speech synthesizer.
Results show a significative improvement for both male and female voices.
arXiv Detail & Related papers (2019-12-29T07:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.