Direction-Aware Joint Adaptation of Neural Speech Enhancement and
Recognition in Real Multiparty Conversational Environments
- URL: http://arxiv.org/abs/2207.07273v1
- Date: Fri, 15 Jul 2022 03:43:35 GMT
- Title: Direction-Aware Joint Adaptation of Neural Speech Enhancement and
Recognition in Real Multiparty Conversational Environments
- Authors: Yicheng Du, Aditya Arie Nugraha, Kouhei Sekiguchi, Yoshiaki Bando,
Mathieu Fontaine, Kazuyoshi Yoshii
- Abstract summary: This paper describes noisy speech recognition for an augmented reality headset that helps verbal communication within real multiparty conversational environments.
We propose a semi-supervised adaptation method that jointly updates the mask estimator and the ASR model at run-time using clean speech signals with ground-truth transcriptions and noisy speech signals with highly-confident estimated transcriptions.
- Score: 21.493664174262737
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes noisy speech recognition for an augmented reality
headset that helps verbal communication within real multiparty conversational
environments. A major approach that has actively been studied in simulated
environments is to sequentially perform speech enhancement and automatic speech
recognition (ASR) based on deep neural networks (DNNs) trained in a supervised
manner. In our task, however, such a pretrained system fails to work due to the
mismatch between the training and test conditions and the head movements of the
user. To enhance only the utterances of a target speaker, we use beamforming
based on a DNN-based speech mask estimator that can adaptively extract the
speech components corresponding to a head-relative particular direction. We
propose a semi-supervised adaptation method that jointly updates the mask
estimator and the ASR model at run-time using clean speech signals with
ground-truth transcriptions and noisy speech signals with highly-confident
estimated transcriptions. Comparative experiments using the state-of-the-art
distant speech recognition system show that the proposed method significantly
improves the ASR performance.
Related papers
- Predictive Speech Recognition and End-of-Utterance Detection Towards Spoken Dialog Systems [55.99999020778169]
We study a function that can predict the forthcoming words and estimate the time remaining until the end of an utterance.
We develop a cross-attention-based algorithm that incorporates both acoustic and linguistic information.
Results demonstrate the proposed model's ability to predict upcoming words and estimate future EOU events up to 300ms prior to the actual EOU.
arXiv Detail & Related papers (2024-09-30T06:29:58Z) - Enhancing Dialogue Speech Recognition with Robust Contextual Awareness via Noise Representation Learning [6.363223418619587]
We introduce Context Noise Representation Learning (CNRL) to enhance robustness against noisy context, ultimately improving dialogue speech recognition accuracy.
Based on the evaluation of speech dialogues, our method shows superior results compared to baselines.
arXiv Detail & Related papers (2024-08-12T10:21:09Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Inner speech recognition through electroencephalographic signals [2.578242050187029]
This work focuses on inner speech recognition starting from EEG signals.
The decoding of the EEG into text should be understood as the classification of a limited number of words (commands)
Speech-related BCIs provide effective vocal communication strategies for controlling devices through speech commands interpreted from brain signals.
arXiv Detail & Related papers (2022-10-11T08:29:12Z) - Direction-Aware Adaptive Online Neural Speech Enhancement with an
Augmented Reality Headset in Real Noisy Conversational Environments [21.493664174262737]
This paper describes the practical response- and performance-aware development of online speech enhancement for an augmented reality (AR) headset.
It helps a user understand conversations made in real noisy echoic environments (e.g., cocktail party)
The method is used with a blind dereverberation method called weighted prediction error (WPE) for transcribing the noisy reverberant speech of a speaker.
arXiv Detail & Related papers (2022-07-15T05:14:27Z) - Personalized Speech Enhancement: New Models and Comprehensive Evaluation [27.572537325449158]
We propose two neural networks for personalized speech enhancement (PSE) models that achieve superior performance to the previously proposed VoiceFilter.
We also create test sets that capture a variety of scenarios that users can encounter during video conferencing.
Our results show that the proposed models can yield better speech recognition accuracy, speech intelligibility, and perceptual quality than the baseline models.
arXiv Detail & Related papers (2021-10-18T21:21:23Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - CASA-Based Speaker Identification Using Cascaded GMM-CNN Classifier in
Noisy and Emotional Talking Conditions [1.6449390849183358]
This work aims at intensifying text-independent speaker identification performance in real application situations such as noisy and emotional talking conditions.
This research proposes and evaluates a novel algorithm to improve the accuracy of speaker identification in emotional and highly-noise susceptible conditions.
arXiv Detail & Related papers (2021-02-11T08:56:12Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.