Related papers: Phase Aware Speech Enhancement using Realisation of Complex-valued LSTM

Phase Aware Speech Enhancement using Realisation of Complex-valued LSTM

URL: http://arxiv.org/abs/2010.14122v1
Date: Tue, 27 Oct 2020 08:16:58 GMT
Title: Phase Aware Speech Enhancement using Realisation of Complex-valued LSTM
Authors: Raktim Gautam Goswami, Sivaganesh Andhavarapu and K Sri Rama Murty
Abstract summary: We propose a realisation of complex-valued short-term memory (RTM) network to estimate the complex ratio mask. The proposed RTM is designed to process the complex-valued sequences using complex arithmetic. When compared to real value based masking methods, the proposed RTM improves over them in several objective measures including perceptual evaluation of speech quality.
Score: 4.047123840446361
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most of the deep learning based speech enhancement (SE) methods rely on estimating the magnitude spectrum of the clean speech signal from the observed noisy speech signal, either by magnitude spectral masking or regression. These methods reuse the noisy phase while synthesizing the time-domain waveform from the estimated magnitude spectrum. However, there have been recent works highlighting the importance of phase in SE. There was an attempt to estimate the complex ratio mask taking phase into account using complex-valued feed-forward neural network (FFNN). But FFNNs cannot capture the sequential information essential for phase estimation. In this work, we propose a realisation of complex-valued long short-term memory (RCLSTM) network to estimate the complex ratio mask (CRM) using sequential information along time. The proposed RCLSTM is designed to process the complex-valued sequences using complex arithmetic, and hence it preserves the dependencies between the real and imaginary parts of CRM and thereby the phase. The proposed method is evaluated on the noisy speech mixtures formed from the Voice-Bank corpus and DEMAND database. When compared to real value based masking methods, the proposed RCLSTM improves over them in several objective measures including perceptual evaluation of speech quality (PESQ), in which it improves by over 4.3%

Related papers

DNN-Based Precoding in RIS-Aided mmWave MIMO Systems With Practical Phase Shift [56.04579258267126]
This paper investigates maximizing the throughput of millimeter wave (mmWave) multiple-input multiple-output (MIMO) systems with obstructed direct communication paths.<n>A reconfigurable intelligent surface (RIS) is employed to enhance transmissions, considering mmWave characteristics related to line-of-sight (LoS) and multipath effects.<n>Deep neural network (DNN) is developed to facilitate faster codeword selection.
arXiv Detail & Related papers (2025-07-03T17:35:06Z)
Robust Simultaneous Multislice MRI Reconstruction Using Deep Generative Priors [4.23798859509348]
Simultaneous multislice (SMS) imaging is a powerful technique for accelerating magnetic resonance imaging (MRI) acquisitions. This study presents a robust SMS MRI reconstruction method using deep generative priors.
arXiv Detail & Related papers (2024-07-31T13:34:14Z)
Deep Reinforcement Learning for IRS Phase Shift Design in Spatiotemporally Correlated Environments [93.30657979626858]
We propose a deep actor-critic algorithm that accounts for channel correlations and destination motion. We show that, when channels aretemporally correlated, the inclusion of the SNR in the state representation with function approximation in ways that inhibit convergence.
arXiv Detail & Related papers (2022-11-02T22:07:36Z)
Parallel Gated Neural Network With Attention Mechanism For Speech Enhancement [0.0]
This paper proposes a novel monaural speech enhancement system, consisting of a Feature Extraction Block (FEB), a Compensation Enhancement Block (ComEB) and a Mask Block (MB) Experiments are conducted on the Librispeech dataset and results show that the proposed model obtains better performance than recent models in terms of ESTOI and PESQ scores.
arXiv Detail & Related papers (2022-10-26T06:42:19Z)
CMGAN: Conformer-based Metric GAN for Speech Enhancement [6.480967714783858]
We propose a conformer-based metric generative adversarial network (CMGAN) for time-frequency domain. In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information. The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech.
arXiv Detail & Related papers (2022-03-28T23:53:34Z)
Discretization and Re-synthesis: an alternative method to solve the Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem. Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols. By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z)
Improved Speech Emotion Recognition using Transfer Learning and Spectrogram Augmentation [56.264157127549446]
Speech emotion recognition (SER) is a challenging task that plays a crucial role in natural human-computer interaction. One of the main challenges in SER is data scarcity. We propose a transfer learning strategy combined with spectrogram augmentation.
arXiv Detail & Related papers (2021-08-05T10:39:39Z)
Improved MVDR Beamforming Using LSTM Speech Models to Clean Spatial Clustering Masks [14.942060304734497]
spatial clustering techniques can achieve significant multi-channel noise reduction across relatively arbitrary microphone configurations. LSTM neural networks have successfully been trained to recognize speech from noise on single-channel inputs, but have difficulty taking full advantage of the information in multi-channel recordings. This paper integrates these two approaches, training LSTM speech models to clean the masks generated by the Model-based EM Source Separation and Localization (MESSL) spatial clustering method.
arXiv Detail & Related papers (2020-12-02T22:35:00Z)
Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment. We implement this algorithm in a real-time robotic system with a microphone array. The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z)
Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features. At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features. At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)
Multi-Time-Scale Convolution for Emotion Recognition from Speech Audio Signals [7.219077740523682]
We introduce the multi-time-scale (MTS) method to create flexibility towards temporal variations when analyzing audio data. We evaluate MTS and standard convolutional layers in different architectures for emotion recognition from speech audio, using 4 datasets of different sizes.
arXiv Detail & Related papers (2020-03-06T12:28:04Z)
Co-VeGAN: Complex-Valued Generative Adversarial Network for Compressive Sensing MR Image Reconstruction [8.856953486775716]
We propose a novel framework based on a complex-valued adversarial network (Co-VeGAN) to process complex-valued input. Our model can process complex-valued input, which enables it to perform high-quality reconstruction of the CS-MR images.
arXiv Detail & Related papers (2020-02-24T20:28:49Z)
Temporal-Spatial Neural Filter: Direction Informed End-to-End Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals. Two main challenges are the complex acoustic environment and the real-time processing requirement. We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.