Phase Aware Speech Enhancement using Realisation of Complex-valued LSTM
- URL: http://arxiv.org/abs/2010.14122v1
- Date: Tue, 27 Oct 2020 08:16:58 GMT
- Title: Phase Aware Speech Enhancement using Realisation of Complex-valued LSTM
- Authors: Raktim Gautam Goswami, Sivaganesh Andhavarapu and K Sri Rama Murty
- Abstract summary: We propose a realisation of complex-valued short-term memory (RTM) network to estimate the complex ratio mask.
The proposed RTM is designed to process the complex-valued sequences using complex arithmetic.
When compared to real value based masking methods, the proposed RTM improves over them in several objective measures including perceptual evaluation of speech quality.
- Score: 4.047123840446361
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most of the deep learning based speech enhancement (SE) methods rely on
estimating the magnitude spectrum of the clean speech signal from the observed
noisy speech signal, either by magnitude spectral masking or regression. These
methods reuse the noisy phase while synthesizing the time-domain waveform from
the estimated magnitude spectrum. However, there have been recent works
highlighting the importance of phase in SE. There was an attempt to estimate
the complex ratio mask taking phase into account using complex-valued
feed-forward neural network (FFNN). But FFNNs cannot capture the sequential
information essential for phase estimation. In this work, we propose a
realisation of complex-valued long short-term memory (RCLSTM) network to
estimate the complex ratio mask (CRM) using sequential information along time.
The proposed RCLSTM is designed to process the complex-valued sequences using
complex arithmetic, and hence it preserves the dependencies between the real
and imaginary parts of CRM and thereby the phase. The proposed method is
evaluated on the noisy speech mixtures formed from the Voice-Bank corpus and
DEMAND database. When compared to real value based masking methods, the
proposed RCLSTM improves over them in several objective measures including
perceptual evaluation of speech quality (PESQ), in which it improves by over
4.3%
Related papers
- Deep Reinforcement Learning for IRS Phase Shift Design in
Spatiotemporally Correlated Environments [93.30657979626858]
We propose a deep actor-critic algorithm that accounts for channel correlations and destination motion.
We show that, when channels aretemporally correlated, the inclusion of the SNR in the state representation with function approximation in ways that inhibit convergence.
arXiv Detail & Related papers (2022-11-02T22:07:36Z) - Parallel Gated Neural Network With Attention Mechanism For Speech
Enhancement [0.0]
This paper proposes a novel monaural speech enhancement system, consisting of a Feature Extraction Block (FEB), a Compensation Enhancement Block (ComEB) and a Mask Block (MB)
Experiments are conducted on the Librispeech dataset and results show that the proposed model obtains better performance than recent models in terms of ESTOI and PESQ scores.
arXiv Detail & Related papers (2022-10-26T06:42:19Z) - CMGAN: Conformer-based Metric GAN for Speech Enhancement [6.480967714783858]
We propose a conformer-based metric generative adversarial network (CMGAN) for time-frequency domain.
In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information.
The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech.
arXiv Detail & Related papers (2022-03-28T23:53:34Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Improved Speech Emotion Recognition using Transfer Learning and
Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction.
One of the main challenges in SER is data scarcity.
We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z) - Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial
Clustering Masks [14.942060304734497]
spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations.
LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings.
This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method.
arXiv Detail & Related papers (2020-12-02T22:35:00Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio
Signals [7.219077740523682]
We introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing audio data.
We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes.
arXiv Detail & Related papers (2020-03-06T12:28:04Z) - Co-VeGAN: Complex-Valued Generative Adversarial Network for Compressive
Sensing MR Image Reconstruction [8.856953486775716]
We propose a novel framework based on a complex-valued adversarial network (Co-VeGAN) to process complex-valued input.
Our model can process complex-valued input, which enables it to perform high-quality reconstruction of the CS-MR images.
arXiv Detail & Related papers (2020-02-24T20:28:49Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.