On the Use of Audio Fingerprinting Features for Speech Enhancement with
Generative Adversarial Network
- URL: http://arxiv.org/abs/2007.13258v1
- Date: Mon, 27 Jul 2020 00:44:16 GMT
- Title: On the Use of Audio Fingerprinting Features for Speech Enhancement with
Generative Adversarial Network
- Authors: Farnood Faraji, Yazid Attabi, Benoit Champagne and Wei-Ping Zhu
- Abstract summary: Time-frequency domain features, such as the Short-Term Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficients (MFCC) are preferred in many approaches.
While the MFCC provide for a compact representation, they ignore the dynamics and distribution of energy in each mel-scale subband.
In this work, a speech enhancement system based on Generative Adversarial Network (GAN) is implemented and tested with a combination of AFP and the Normalized Spectral Subband Centroids (NSSC)
- Score: 24.287237963000745
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The advent of learning-based methods in speech enhancement has revived the
need for robust and reliable training features that can compactly represent
speech signals while preserving their vital information. Time-frequency domain
features, such as the Short-Term Fourier Transform (STFT) and Mel-Frequency
Cepstral Coefficients (MFCC), are preferred in many approaches. While the MFCC
provide for a compact representation, they ignore the dynamics and distribution
of energy in each mel-scale subband. In this work, a speech enhancement system
based on Generative Adversarial Network (GAN) is implemented and tested with a
combination of Audio FingerPrinting (AFP) features obtained from the MFCC and
the Normalized Spectral Subband Centroids (NSSC). The NSSC capture the
locations of speech formants and complement the MFCC in a crucial way. In
experiments with diverse speakers and noise types, GAN-based speech enhancement
with the proposed AFP feature combination achieves the best objective
performance while reducing memory requirements and training time.
Related papers
- Advanced Clustering Techniques for Speech Signal Enhancement: A Review and Metanalysis of Fuzzy C-Means, K-Means, and Kernel Fuzzy C-Means Methods [0.6530047924748276]
Speech signal processing is tasked with improving the clarity and comprehensibility of audio data in noisy environments.
The quality of speech recognition directly impacts user experience and accessibility in technology-driven communication.
This review paper explores advanced clustering techniques, particularly focusing on the Kernel Fuzzy C-Means (KFCM) method.
arXiv Detail & Related papers (2024-09-28T20:21:05Z) - SpeechGPT-Gen: Scaling Chain-of-Information Speech Generation [56.913182262166316]
Chain-of-Information Generation (CoIG) is a method for decoupling semantic and perceptual information in large-scale speech generation.
SpeechGPT-Gen is efficient in semantic and perceptual information modeling.
It markedly excels in zero-shot text-to-speech, zero-shot voice conversion, and speech-to-speech dialogue.
arXiv Detail & Related papers (2024-01-24T15:25:01Z) - RTFS-Net: Recurrent Time-Frequency Modelling for Efficient Audio-Visual Speech Separation [18.93255531121519]
We present a novel time-frequency domain audio-visual speech separation method.
RTFS-Net applies its algorithms on the complex time-frequency bins yielded by the Short-Time Fourier Transform.
This is the first time-frequency domain audio-visual speech separation method to outperform all contemporary time-domain counterparts.
arXiv Detail & Related papers (2023-09-29T12:38:00Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Frequency-centroid features for word recognition of non-native English
speakers [1.9249287163937974]
The aim of this work is to investigate complementary features which can aid the quintessential Mel frequency cepstral coefficients (MFCCs)
FCs encapsulate the spectral centres of the different bands of the speech spectrum, with the bands defined by the Mel filterbank.
A two-stage Convolution Neural Network (CNN) is used to model the features of the English words uttered with Arabic, French and Spanish accents.
arXiv Detail & Related papers (2022-06-14T21:19:49Z) - CMGAN: Conformer-based Metric GAN for Speech Enhancement [6.480967714783858]
We propose a conformer-based metric generative adversarial network (CMGAN) for time-frequency domain.
In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information.
The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech.
arXiv Detail & Related papers (2022-03-28T23:53:34Z) - Speech-enhanced and Noise-aware Networks for Robust Speech Recognition [25.279902171523233]
A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition.
The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task.
Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
arXiv Detail & Related papers (2022-03-25T15:04:51Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Time-domain Speech Enhancement with Generative Adversarial Learning [53.74228907273269]
This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN)
TSEGAN is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem.
In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN.
arXiv Detail & Related papers (2021-03-30T08:09:49Z) - Data Fusion for Audiovisual Speaker Localization: Extending Dynamic
Stream Weights to the Spatial Domain [103.3388198420822]
Esting the positions of multiple speakers can be helpful for tasks like automatic speech recognition or speaker diarization.
This paper proposes a novel audiovisual data fusion framework for speaker localization by assigning individual dynamic stream weights to specific regions.
A performance evaluation using audiovisual recordings yields promising results, with the proposed fusion approach outperforming all baseline models.
arXiv Detail & Related papers (2021-02-23T09:59:31Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.