Deep Neural Mel-Subband Beamformer for In-car Speech Separation
- URL: http://arxiv.org/abs/2211.12590v1
- Date: Tue, 22 Nov 2022 21:11:26 GMT
- Title: Deep Neural Mel-Subband Beamformer for In-car Speech Separation
- Authors: Vinay Kothapally, Yong Xu, Meng Yu, Shi-Xiong Zhang, Dong Yu
- Abstract summary: We propose a DL-based melband-based beamformer to perform speech separation in a car environment.
As opposed to conventional subband approaches, our framework uses a melband based sub selection strategy.
We find that our proposed framework achieves better separation performance over all SB and FB approaches.
- Score: 44.58289679847228
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While current deep learning (DL)-based beamforming techniques have been
proved effective in speech separation, they are often designed to process
narrow-band (NB) frequencies independently which results in higher
computational costs and inference times, making them unsuitable for real-world
use. In this paper, we propose DL-based mel-subband spatio-temporal beamformer
to perform speech separation in a car environment with reduced computation cost
and inference time. As opposed to conventional subband (SB) approaches, our
framework uses a mel-scale based subband selection strategy which ensures a
fine-grained processing for lower frequencies where most speech formant
structure is present, and coarse-grained processing for higher frequencies. In
a recursive way, robust frame-level beamforming weights are determined for each
speaker location/zone in a car from the estimated subband speech and noise
covariance matrices. Furthermore, proposed framework also estimates and
suppresses any echoes from the loudspeaker(s) by using the echo reference
signals. We compare the performance of our proposed framework to several NB,
SB, and full-band (FB) processing techniques in terms of speech quality and
recognition metrics. Based on experimental evaluations on simulated and
real-world recordings, we find that our proposed framework achieves better
separation performance over all SB and FB approaches and achieves performance
closer to NB processing techniques while requiring lower computing cost.
Related papers
- Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising [15.152748065111194]
This paper describes speech enhancement for realtime automatic speech recognition in real environments.
It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming.
The performance of such a supervised approach, however, is drastically degraded under mismatched conditions.
arXiv Detail & Related papers (2024-10-30T08:32:47Z) - A Lightweight and Real-Time Binaural Speech Enhancement Model with Spatial Cues Preservation [19.384404014248762]
Binaural speech enhancement aims to improve the speech quality and intelligibility of noisy signals received by hearing devices.
Existing methods often suffer from the compromise between noise reduction (NR) capacity and spatial cues ( SCP) accuracy and preservation.
We present a learning-based lightweight complex convolutional network (LBCCN) which excels in NR by filtering low-frequency bands and keeping the rest.
arXiv Detail & Related papers (2024-09-19T03:52:50Z) - High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - Inference and Denoise: Causal Inference-based Neural Speech Enhancement [83.4641575757706]
This study addresses the speech enhancement (SE) task within the causal inference paradigm by modeling the noise presence as an intervention.
The proposed causal inference-based speech enhancement (CISE) separates clean and noisy frames in an intervened noisy speech using a noise detector and assigns both sets of frames to two mask-based enhancement modules (EMs) to perform noise-conditional SE.
arXiv Detail & Related papers (2022-11-02T15:03:50Z) - Speech Enhancement and Dereverberation with Diffusion-based Generative
Models [14.734454356396157]
We present a detailed overview of the diffusion process that is based on a differential equation.
We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates.
In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models.
arXiv Detail & Related papers (2022-08-11T13:55:12Z) - Direction-Aware Adaptive Online Neural Speech Enhancement with an
Augmented Reality Headset in Real Noisy Conversational Environments [21.493664174262737]
This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset.
It helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party)
The method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker.
arXiv Detail & Related papers (2022-07-15T05:14:27Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.