Task-aware Warping Factors in Mask-based Speech Enhancement
- URL: http://arxiv.org/abs/2108.12128v1
- Date: Fri, 27 Aug 2021 05:57:37 GMT
- Title: Task-aware Warping Factors in Mask-based Speech Enhancement
- Authors: Qiongqiong Wang, Kong Aik Lee, Takafumi Koshinaka, Koji Okabe, Hitoshi
Yamamoto
- Abstract summary: We propose the use of two task-aware warping factors in mask-based speech enhancement (SE)
One controls the balance between speech-maintenance and noise-removal in training phases, while the other controls SE power applied to specific downstream tasks.
It is easy to apply the proposed dual-warping factors approach to any mask-based SE method.
- Score: 31.913984833849753
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes the use of two task-aware warping factors in mask-based
speech enhancement (SE). One controls the balance between speech-maintenance
and noise-removal in training phases, while the other controls SE power applied
to specific downstream tasks in testing phases. Our intention is to alleviate
the problem that SE systems trained to improve speech quality often fail to
improve other downstream tasks, such as automatic speaker verification (ASV)
and automatic speech recognition (ASR), because they do not share the same
objects. It is easy to apply the proposed dual-warping factors approach to any
mask-based SE method, and it allows a single SE system to handle multiple tasks
without task-dependent training. The effectiveness of our proposed approach has
been confirmed on the SITW dataset for ASV evaluation and the LibriSpeech
dataset for ASR and speech quality evaluations of 0-20dB. We show that
different warping values are necessary for a single SE to achieve optimal
performance w.r.t. the three tasks. With the use of task-dependent warping
factors, speech quality was improved by an 84.7% PESQ increase, ASV had a 22.4%
EER reduction, and ASR had a 52.2% WER reduction, on 0dB speech. The
effectiveness of the task-dependent warping factors were also cross-validated
on VoxCeleb-1 test set for ASV and LibriSpeech dev-clean set for ASV and
quality evaluations. The proposed method is highly effective and easy to apply
in practice.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - CPM: Class-conditional Prompting Machine for Audio-visual Segmentation [17.477225065057993]
Class-conditional Prompting Machine (CPM) improves bipartite matching with a learning strategy combining class-agnostic queries with class-conditional queries.
We conduct experiments on AVS benchmarks, demonstrating that our method achieves state-of-the-art (SOTA) segmentation accuracy.
arXiv Detail & Related papers (2024-07-07T13:20:21Z) - Active Learning with Task Adaptation Pre-training for Speech Emotion Recognition [17.59356583727259]
Speech emotion recognition (SER) has garnered increasing attention due to its wide range of applications.
We propose an active learning (AL)-based fine-tuning framework for SER, called textscAfter.
Our proposed method improves accuracy by 8.45% and reduces time consumption by 79%.
arXiv Detail & Related papers (2024-05-01T04:05:29Z) - Use of Speech Impairment Severity for Dysarthric Speech Recognition [37.93801885333925]
This paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognition.
Experiments conducted on UASpeech suggest incorporating speech impairment severity into state-of-the-art hybrid DNN, E2E Conformer and pre-trained Wav2vec 2.0 ASR systems.
arXiv Detail & Related papers (2023-05-18T02:42:59Z) - Sequence-level self-learning with multiple hypotheses [53.04725240411895]
We develop new self-learning techniques with an attention-based sequence-to-sequence (seq2seq) model for automatic speech recognition (ASR)
In contrast to conventional unsupervised learning approaches, we adopt the emphmulti-task learning (MTL) framework.
Our experiment results show that our method can reduce the WER on the British speech data from 14.55% to 10.36% compared to the baseline model trained with the US English data only.
arXiv Detail & Related papers (2021-12-10T20:47:58Z) - Improving Noise Robustness of Contrastive Speech Representation Learning
with Speech Reconstruction [109.44933866397123]
Noise robustness is essential for deploying automatic speech recognition systems in real-world environments.
We employ a noise-robust representation learned by a refined self-supervised framework for noisy speech recognition.
We achieve comparable performance to the best supervised approach reported with only 16% of labeled data.
arXiv Detail & Related papers (2021-10-28T20:39:02Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - NUVA: A Naming Utterance Verifier for Aphasia Treatment [49.114436579008476]
Assessment of speech performance using picture naming tasks is a key method for both diagnosis and monitoring of responses to treatment interventions by people with aphasia (PWA)
Here we present NUVA, an utterance verification system incorporating a deep learning element that classifies 'correct' versus'incorrect' naming attempts from aphasic stroke patients.
When tested on eight native British-English speaking PWA the system's performance accuracy ranged between 83.6% to 93.6%, with a 10-fold cross-validation mean of 89.5%.
arXiv Detail & Related papers (2021-02-10T13:00:29Z) - Improving Readability for Automatic Speech Recognition Transcription [50.86019112545596]
We propose a novel NLP task called ASR post-processing for readability (APR)
APR aims to transform the noisy ASR output into a readable text for humans and downstream tasks while maintaining the semantic meaning of the speaker.
We compare fine-tuned models based on several open-sourced and adapted pre-trained models with the traditional pipeline method.
arXiv Detail & Related papers (2020-04-09T09:26:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.