HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech
Deep Features in Adversarial Networks
- URL: http://arxiv.org/abs/2006.05694v2
- Date: Mon, 21 Sep 2020 20:37:02 GMT
- Title: HiFi-GAN: High-Fidelity Denoising and Dereverberation Based on Speech
Deep Features in Adversarial Networks
- Authors: Jiaqi Su, Zeyu Jin, Adam Finkelstein
- Abstract summary: HiFi-GAN transforms recorded speech to sound as though it had been recorded in a studio.
It relies on the deep feature matching losses of the discriminators to improve the perceptual quality of enhanced speech.
It significantly outperforms state-of-the-art baseline methods in both objective and subjective experiments.
- Score: 29.821666380496637
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world audio recordings are often degraded by factors such as noise,
reverberation, and equalization distortion. This paper introduces HiFi-GAN, a
deep learning method to transform recorded speech to sound as though it had
been recorded in a studio. We use an end-to-end feed-forward WaveNet
architecture, trained with multi-scale adversarial discriminators in both the
time domain and the time-frequency domain. It relies on the deep feature
matching losses of the discriminators to improve the perceptual quality of
enhanced speech. The proposed model generalizes well to new speakers, new
speech content, and new environments. It significantly outperforms
state-of-the-art baseline methods in both objective and subjective experiments.
Related papers
- FINALLY: fast and universal speech enhancement with studio-like quality [7.207284147264852]
We address the challenge of speech enhancement in real-world recordings, which often contain various forms of distortion.
We study various feature extractors for perceptual loss to facilitate the stability of adversarial training.
We integrate WavLM-based perceptual loss into MS-STFT adversarial training pipeline, creating an effective and stable training procedure for the speech enhancement model.
arXiv Detail & Related papers (2024-10-08T11:16:03Z) - Audio-Visual Speech Enhancement with Score-Based Generative Models [22.559617939136505]
This paper introduces an audio-visual speech enhancement system that leverages score-based generative models.
We exploit audio-visual embeddings obtained from a self-super-vised learning model that has been fine-tuned on lipreading.
Experimental evaluations show that the proposed audio-visual speech enhancement system yields improved speech quality.
arXiv Detail & Related papers (2023-06-02T10:43:42Z) - End-to-End Binaural Speech Synthesis [71.1869877389535]
We present an end-to-end speech synthesis system that combines a low-bitrate audio system with a powerful decoder.
We demonstrate the capability of the adversarial loss in capturing environment effects needed to create an authentic auditory scene.
arXiv Detail & Related papers (2022-07-08T05:18:36Z) - Universal Speech Enhancement with Score-based Diffusion [21.294665965300922]
We present a universal speech enhancement system that tackles 55 different distortions at the same time.
Our approach consists of a generative model that employs score-based diffusion, together with a multi-resolution conditioning network.
We show that this approach significantly outperforms the state of the art in a subjective test performed by expert listeners.
arXiv Detail & Related papers (2022-06-07T07:32:32Z) - Improving Distortion Robustness of Self-supervised Speech Processing
Tasks with Domain Adaptation [60.26511271597065]
Speech distortions are a long-standing problem that degrades the performance of supervisely trained speech processing models.
It is high time that we enhance the robustness of speech processing models to obtain good performance when encountering speech distortions.
arXiv Detail & Related papers (2022-03-30T07:25:52Z) - Learning Audio-Visual Dereverberation [87.52880019747435]
Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition.
Our idea is to learn to dereverberate speech from audio-visual observations.
We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene.
arXiv Detail & Related papers (2021-06-14T20:01:24Z) - Cross-domain Adaptation with Discrepancy Minimization for
Text-independent Forensic Speaker Verification [61.54074498090374]
This study introduces a CRSS-Forensics audio dataset collected in multiple acoustic environments.
We pre-train a CNN-based network using the VoxCeleb data, followed by an approach which fine-tunes part of the high-level network layers with clean speech from CRSS-Forensics.
arXiv Detail & Related papers (2020-09-05T02:54:33Z) - Knowing What to Listen to: Early Attention for Deep Speech
Representation Learning [25.71206255965502]
We propose the novel Fine-grained Early Attention (FEFA) for speech signals.
This model is capable of focusing on information items as small as frequency bins.
We evaluate the proposed model on two popular tasks of speaker recognition and speech emotion recognition.
arXiv Detail & Related papers (2020-09-03T17:40:27Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.