mdctGAN: Taming transformer-based GAN for speech super-resolution with
Modified DCT spectra
- URL: http://arxiv.org/abs/2305.11104v2
- Date: Fri, 19 May 2023 07:26:43 GMT
- Title: mdctGAN: Taming transformer-based GAN for speech super-resolution with
Modified DCT spectra
- Authors: Chenhao Shuai, Chaohua Shi, Lu Gan and Hongqing Liu
- Abstract summary: Speech super-resolution (SSR) aims to recover a high resolution (HR) speech from its corresponding low resolution (LR) counterpart.
Recent SSR methods focus more on the reconstruction of the magnitude spectrogram, ignoring the importance of phase reconstruction.
We propose mdctGAN, a novel SSR framework based on modified discrete cosine transform (MDCT)
- Score: 4.721572768262729
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech super-resolution (SSR) aims to recover a high resolution (HR) speech
from its corresponding low resolution (LR) counterpart. Recent SSR methods
focus more on the reconstruction of the magnitude spectrogram, ignoring the
importance of phase reconstruction, thereby limiting the recovery quality. To
address this issue, we propose mdctGAN, a novel SSR framework based on modified
discrete cosine transform (MDCT). By adversarial learning in the MDCT domain,
our method reconstructs HR speeches in a phase-aware manner without vocoders or
additional post-processing. Furthermore, by learning frequency consistent
features with self-attentive mechanism, mdctGAN guarantees a high quality
speech reconstruction. For VCTK corpus dataset, the experiment results show
that our model produces natural auditory quality with high MOS and PESQ scores.
It also achieves the state-of-the-art log-spectral-distance (LSD) performance
on 48 kHz target resolution from various input rates. Code is available from
https://github.com/neoncloud/mdctGAN
Related papers
- Wave-U-Mamba: An End-To-End Framework For High-Quality And Efficient Speech Super Resolution [4.495657539150699]
Speech Super-Resolution (SSR) is a task of enhancing low-resolution speech signals by restoring missing high-frequency components.
Conventional approaches typically reconstruct log-mel features, followed by a vocoder that generates high-resolution speech in the waveform domain.
We propose a method, referred to as Wave-U-Mamba, that directly performs SSR in time domain.
arXiv Detail & Related papers (2024-09-14T06:52:00Z) - TC-KANRecon: High-Quality and Accelerated MRI Reconstruction via Adaptive KAN Mechanisms and Intelligent Feature Scaling [7.281993256973667]
This study presents an innovative conditional guided diffusion model, named as TC-KANRecon.
It incorporates the Multi-Free U-KAN (MF-UKAN) module and a dynamic clipping strategy.
Experimental results demonstrate that the proposed method outperforms other MRI reconstruction methods in both qualitative and quantitative evaluations.
arXiv Detail & Related papers (2024-08-11T06:31:56Z) - Speech enhancement with frequency domain auto-regressive modeling [34.55703785405481]
Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation.
We propose a unified framework of speech dereverberation for improving the speech quality and the automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2023-09-24T03:25:51Z) - Learning Detail-Structure Alternative Optimization for Blind
Super-Resolution [69.11604249813304]
We propose an effective and kernel-free network, namely DSSR, which enables recurrent detail-structure alternative optimization without blur kernel prior incorporation for blind SR.
In our DSSR, a detail-structure modulation module (DSMM) is built to exploit the interaction and collaboration of image details and structures.
Our method achieves the state-of-the-art against existing methods.
arXiv Detail & Related papers (2022-12-03T14:44:17Z) - Towards Improved Room Impulse Response Estimation for Speech Recognition [53.04440557465013]
We propose a novel approach for blind room impulse response (RIR) estimation systems in the context of far-field automatic speech recognition (ASR)
We first draw the connection between improved RIR estimation and improved ASR performance, as a means of evaluating neural RIR estimators.
We then propose a generative adversarial network (GAN) based architecture that encodes RIR features from reverberant speech and constructs an RIR from the encoded features.
arXiv Detail & Related papers (2022-11-08T00:40:27Z) - CMGAN: Conformer-based Metric GAN for Speech Enhancement [6.480967714783858]
We propose a conformer-based metric generative adversarial network (CMGAN) for time-frequency domain.
In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information.
The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech.
arXiv Detail & Related papers (2022-03-28T23:53:34Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - HDNet: High-resolution Dual-domain Learning for Spectral Compressive
Imaging [138.04956118993934]
We propose a high-resolution dual-domain learning network (HDNet) for HSI reconstruction.
On the one hand, the proposed HR spatial-spectral attention module with its efficient feature fusion provides continuous and fine pixel-level features.
On the other hand, frequency domain learning (FDL) is introduced for HSI reconstruction to narrow the frequency domain discrepancy.
arXiv Detail & Related papers (2022-03-04T06:37:45Z) - ReconFormer: Accelerated MRI Reconstruction Using Recurrent Transformer [60.27951773998535]
We propose a recurrent transformer model, namely textbfReconFormer, for MRI reconstruction.
It can iteratively reconstruct high fertility magnetic resonance images from highly under-sampled k-space data.
We show that it achieves significant improvements over the state-of-the-art methods with better parameter efficiency.
arXiv Detail & Related papers (2022-01-23T21:58:19Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - FreqNet: A Frequency-domain Image Super-Resolution Network with Dicrete
Cosine Transform [16.439669339293747]
Single image super-resolution(SISR) is an ill-posed problem that aims to obtain high-resolution (HR) output from low-resolution (LR) input.
Despite the high peak signal-to-noise ratios(PSNR) results, it is difficult to determine whether the model correctly adds desired high-frequency details.
We propose FreqNet, an intuitive pipeline from the frequency domain perspective, to solve this problem.
arXiv Detail & Related papers (2021-11-21T11:49:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.