Direction-Aware Adaptive Online Neural Speech Enhancement with an
Augmented Reality Headset in Real Noisy Conversational Environments
- URL: http://arxiv.org/abs/2207.07296v1
- Date: Fri, 15 Jul 2022 05:14:27 GMT
- Title: Direction-Aware Adaptive Online Neural Speech Enhancement with an
Augmented Reality Headset in Real Noisy Conversational Environments
- Authors: Kouhei Sekiguchi, Aditya Arie Nugraha, Yicheng Du, Yoshiaki Bando,
Mathieu Fontaine, Kazuyoshi Yoshii
- Abstract summary: This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset.
It helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party)
The method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker.
- Score: 21.493664174262737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the practical response- and performance-aware
development of online speech enhancement for an augmented reality (AR) headset
that helps a user understand conversations made in real noisy echoic
environments (e.g., cocktail party). One may use a state-of-the-art blind
source separation method called fast multichannel nonnegative matrix
factorization (FastMNMF) that works well in various environments thanks to its
unsupervised nature. Its heavy computational cost, however, prevents its
application to real-time processing. In contrast, a supervised beamforming
method that uses a deep neural network (DNN) for estimating spatial information
of speech and noise readily fits real-time processing, but suffers from drastic
performance degradation in mismatched conditions. Given such complementary
characteristics, we propose a dual-process robust online speech enhancement
method based on DNN-based beamforming with FastMNMF-guided adaptation. FastMNMF
(back end) is performed in a mini-batch style and the noisy and enhanced speech
pairs are used together with the original parallel training data for updating
the direction-aware DNN (front end) with backpropagation at a
computationally-allowable interval. This method is used with a blind
dereverberation method called weighted prediction error (WPE) for transcribing
the noisy reverberant speech of a speaker, which can be detected from video or
selected by a user's hand gesture or eye gaze, in a streaming manner and
spatially showing the transcriptions with an AR technique. Our experiment
showed that the word error rate was improved by more than 10 points with the
run-time adaptation using only twelve minutes of observation.
Related papers
- Run-Time Adaptation of Neural Beamforming for Robust Speech Dereverberation and Denoising [15.152748065111194]
This paper describes speech enhancement for realtime automatic speech recognition in real environments.
It estimates the masks of clean dry speech from a noisy echoic mixture spectrogram with a deep neural network (DNN) and then computes a enhancement filter used for beamforming.
The performance of such a supervised approach, however, is drastically degraded under mismatched conditions.
arXiv Detail & Related papers (2024-10-30T08:32:47Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Unsupervised Noise adaptation using Data Simulation [21.866522173387715]
We propose a generative adversarial network based method to efficiently learn a converse clean-to-noisy transformation.
Experimental results show that our method effectively mitigates the domain mismatch between training and test sets.
arXiv Detail & Related papers (2023-02-23T12:57:20Z) - Prompt Tuning of Deep Neural Networks for Speaker-adaptive Visual Speech Recognition [66.94463981654216]
We propose prompt tuning methods of Deep Neural Networks (DNNs) for speaker-adaptive Visual Speech Recognition (VSR)
We finetune prompts on adaptation data of target speakers instead of modifying the pre-trained model parameters.
The effectiveness of the proposed method is evaluated on both word- and sentence-level VSR databases.
arXiv Detail & Related papers (2023-02-16T06:01:31Z) - Direction-Aware Joint Adaptation of Neural Speech Enhancement and
Recognition in Real Multiparty Conversational Environments [21.493664174262737]
This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments.
We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions.
arXiv Detail & Related papers (2022-07-15T03:43:35Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Multi-Tones' Phase Coding (MTPC) of Interaural Time Difference by
Spiking Neural Network [68.43026108936029]
We propose a pure spiking neural network (SNN) based computational model for precise sound localization in the noisy real-world environment.
We implement this algorithm in a real-time robotic system with a microphone array.
The experiment results show a mean error azimuth of 13 degrees, which surpasses the accuracy of the other biologically plausible neuromorphic approach for sound source localization.
arXiv Detail & Related papers (2020-07-07T08:22:56Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.