End-to-end Whispered Speech Recognition with Frequency-weighted
Approaches and Pseudo Whisper Pre-training
- URL: http://arxiv.org/abs/2005.01972v2
- Date: Sun, 8 Nov 2020 06:22:36 GMT
- Title: End-to-end Whispered Speech Recognition with Frequency-weighted
Approaches and Pseudo Whisper Pre-training
- Authors: Heng-Jui Chang, Alexander H. Liu, Hung-yi Lee, Lin-shan Lee
- Abstract summary: We present several approaches for end-to-end (E2E) recognition of whispered speech.
We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus.
As long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer.
- Score: 130.56878980058966
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Whispering is an important mode of human speech, but no end-to-end
recognition results for it were reported yet, probably due to the scarcity of
available whispered speech data. In this paper, we present several approaches
for end-to-end (E2E) recognition of whispered speech considering the special
characteristics of whispered speech and the scarcity of data. This includes a
frequency-weighted SpecAugment policy and a frequency-divided CNN feature
extractor for better capturing the high-frequency structures of whispered
speech, and a layer-wise transfer learning approach to pre-train a model with
normal or normal-to-whispered converted speech then fine-tune it with whispered
speech to bridge the gap between whispered and normal speech. We achieve an
overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively
small whispered TIMIT corpus. The results indicate as long as we have a good
E2E model pre-trained on normal or pseudo-whispered speech, a relatively small
set of whispered speech may suffice to obtain a reasonably good E2E whispered
speech recognizer.
Related papers
- Quartered Spectral Envelope and 1D-CNN-based Classification of Normally Phonated and Whispered Speech [0.0]
The presence of pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform.
We propose the use of one dimensional convolutional neural networks (1D-CNN) to capture these features.
The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset.
arXiv Detail & Related papers (2024-08-25T07:17:11Z) - Towards Improving NAM-to-Speech Synthesis Intelligibility using Self-Supervised Speech Models [24.943609458024596]
We propose a novel approach to significantly improve the intelligibility in the Non-Audible Murmur (NAM)-to-speech conversion task.
Unlike conventional methods that explicitly record ground-truth speech, our methodology relies on self-supervision and speech-to-speech synthesis.
Our method surpasses the current state-of-the-art (SOTA) by 29.08% improvement in the Mel-Cepstral Distortion (MCD) metric.
arXiv Detail & Related papers (2024-07-26T06:44:01Z) - Pre-Finetuning for Few-Shot Emotional Speech Recognition [20.894029832911617]
We view speaker adaptation as a few-shot learning problem.
We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives.
arXiv Detail & Related papers (2023-02-24T22:38:54Z) - ESSumm: Extractive Speech Summarization from Untranscribed Meeting [7.309214379395552]
We propose a novel architecture for direct extractive speech-to-speech summarization, ESSumm.
We leverage the off-the-shelf self-supervised convolutional neural network to extract the deep speech features from raw audio.
Our approach automatically predicts the optimal sequence of speech segments that capture the key information with a target summary length.
arXiv Detail & Related papers (2022-09-14T20:13:15Z) - Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks.
In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Injecting Text in Self-Supervised Speech Pretraining [33.676479965610774]
We propose to jointly learn representations during pretraining from two different modalities: speech and text.
tts4pretrain complements the power of contrastive learning in self-supervision.
We demonstrate Word Error Rate (WER) reductions of 10% relative on the well-benchmarked, Librispeech task.
arXiv Detail & Related papers (2021-08-27T11:36:40Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.