Diffusion-Based Speech Enhancement with Joint Generative and Predictive
Decoders
- URL: http://arxiv.org/abs/2305.10734v2
- Date: Wed, 28 Feb 2024 12:10:19 GMT
- Title: Diffusion-Based Speech Enhancement with Joint Generative and Predictive
Decoders
- Authors: Hao Shi, Kazuki Shimada, Masato Hirano, Takashi Shibuya, Yuichiro
Koyama, Zhi Zhong, Shusuke Takahashi, Tatsuya Kawahara, Yuki Mitsufuji
- Abstract summary: We propose a unified system that use jointly generative and predictive decoders across two levels.
Experiments conducted on the Voice-Bank dataset demonstrate that incorporating predictive information leads to faster decoding and higher PESQ scores.
- Score: 38.78712921188612
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion-based generative speech enhancement (SE) has recently received
attention, but reverse diffusion remains time-consuming. One solution is to
initialize the reverse diffusion process with enhanced features estimated by a
predictive SE system. However, the pipeline structure currently does not
consider for a combined use of generative and predictive decoders. The
predictive decoder allows us to use the further complementarity between
predictive and diffusion-based generative SE. In this paper, we propose a
unified system that use jointly generative and predictive decoders across two
levels. The encoder encodes both generative and predictive information at the
shared encoding level. At the decoded feature level, we fuse the two decoded
features by generative and predictive decoders. Specifically, the two SE
modules are fused in the initial and final diffusion steps: the initial fusion
initializes the diffusion process with the predictive SE to improve
convergence, and the final fusion combines the two complementary SE outputs to
enhance SE performance. Experiments conducted on the Voice-Bank dataset
demonstrate that incorporating predictive information leads to faster decoding
and higher PESQ scores compared with other score-based diffusion SE (StoRM and
SGMSE+).
Related papers
- Take an Irregular Route: Enhance the Decoder of Time-Series Forecasting
Transformer [9.281993269355544]
We propose FPPformer to utilize bottom-up and top-down architectures in encoder and decoder to build the full and rational hierarchy.
Extensive experiments with six state-of-the-art benchmarks verify the promising performances of FPPformer.
arXiv Detail & Related papers (2023-12-10T06:50:56Z) - Complexity Matters: Rethinking the Latent Space for Generative Modeling [65.64763873078114]
In generative modeling, numerous successful approaches leverage a low-dimensional latent space, e.g., Stable Diffusion.
In this study, we aim to shed light on this under-explored topic by rethinking the latent space from the perspective of model complexity.
arXiv Detail & Related papers (2023-07-17T07:12:29Z) - Denoising Diffusion Autoencoders are Unified Self-supervised Learners [58.194184241363175]
This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners.
DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders.
Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet.
arXiv Detail & Related papers (2023-03-17T04:20:47Z) - UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units [64.61596752343837]
We present a novel two-pass direct S2ST architecture, UnitY, which first generates textual representations and predicts discrete acoustic units.
We enhance the model performance by subword prediction in the first-pass decoder.
We show that the proposed methods boost the performance even when predicting spectrogram in the second pass.
arXiv Detail & Related papers (2022-12-15T18:58:28Z) - Denoising Diffusion Error Correction Codes [92.10654749898927]
Recently, neural decoders have demonstrated their advantage over classical decoding techniques.
Recent state-of-the-art neural decoders suffer from high complexity and lack the important iterative scheme characteristic of many legacy decoders.
We propose to employ denoising diffusion models for the soft decoding of linear codes at arbitrary block lengths.
arXiv Detail & Related papers (2022-09-16T11:00:50Z) - Efficient VVC Intra Prediction Based on Deep Feature Fusion and
Probability Estimation [57.66773945887832]
We propose to optimize Versatile Video Coding (VVC) complexity at intra-frame prediction, with a two-stage framework of deep feature fusion and probability estimation.
Experimental results on standard database demonstrate the superiority of proposed method, especially for High Definition (HD) and Ultra-HD (UHD) video sequences.
arXiv Detail & Related papers (2022-05-07T08:01:32Z) - End-to-end optimized image compression with competition of prior
distributions [29.585370305561582]
We propose a compression scheme that uses a single convolutional autoencoder and multiple learned prior distributions.
Our method offers rate-distortion performance comparable to that obtained with a predicted parametrized prior.
arXiv Detail & Related papers (2021-11-17T15:04:01Z) - End-to-end Neural Video Coding Using a Compound Spatiotemporal
Representation [33.54844063875569]
We propose a hybrid motion compensation (HMC) method that adaptively combines the predictions generated by two approaches.
Specifically, we generate a compoundtemporal representation (STR) through a recurrent information aggregation (RIA) module.
We further design a one-to-many decoder pipeline to generate multiple predictions from the CSTR, including vector-based resampling, adaptive kernel-based resampling, compensation mode selection maps and texture enhancements.
arXiv Detail & Related papers (2021-08-05T19:43:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.