Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer
- URL: http://arxiv.org/abs/2210.11885v1
- Date: Fri, 21 Oct 2022 11:26:59 GMT
- Title: Deep LSTM Spoken Term Detection using Wav2Vec 2.0 Recognizer
- Authors: Jan \v{S}vec, Jan Lehe\v{c}ka, Lubo\v{s} \v{S}m\'idl
- Abstract summary: The paper describes a bootstrapping approach that allows the transfer of the knowledge contained in traditional pronunciation vocabulary of DNN-HMM hybrid ASR into the context of grapheme-based Wav2Vec.
The proposed method outperforms the previously published system based on the combination of the DNN-HMM hybrid ASR and phoneme recognizer by a large margin on the MALACH data in both English and Czech languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In recent years, the standard hybrid DNN-HMM speech recognizers are
outperformed by the end-to-end speech recognition systems. One of the very
promising approaches is the grapheme Wav2Vec 2.0 model, which uses the
self-supervised pretraining approach combined with transfer learning of the
fine-tuned speech recognizer. Since it lacks the pronunciation vocabulary and
language model, the approach is suitable for tasks where obtaining such models
is not easy or almost impossible.
In this paper, we use the Wav2Vec speech recognizer in the task of spoken
term detection over a large set of spoken documents. The method employs a deep
LSTM network which maps the recognized hypothesis and the searched term into a
shared pronunciation embedding space in which the term occurrences and the
assigned scores are easily computed.
The paper describes a bootstrapping approach that allows the transfer of the
knowledge contained in traditional pronunciation vocabulary of DNN-HMM hybrid
ASR into the context of grapheme-based Wav2Vec. The proposed method outperforms
the previously published system based on the combination of the DNN-HMM hybrid
ASR and phoneme recognizer by a large margin on the MALACH data in both English
and Czech languages.
Related papers
- Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - One model to rule them all ? Towards End-to-End Joint Speaker
Diarization and Speech Recognition [50.055765860343286]
This paper presents a novel framework for joint speaker diarization and automatic speech recognition.
The framework, named SLIDAR, can process arbitrary length inputs and can handle any number of speakers.
Experiments performed on monaural recordings from the AMI corpus confirm the effectiveness of the method in both close-talk and far-field speech scenarios.
arXiv Detail & Related papers (2023-10-02T23:03:30Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - AV-data2vec: Self-supervised Learning of Audio-Visual Speech
Representations with Contextualized Target Representations [88.30635799280923]
We introduce AV-data2vec which builds audio-visual representations based on predicting contextualized representations.
Results on LRS3 show that AV-data2vec consistently outperforms existing methods with the same amount of data and model size.
arXiv Detail & Related papers (2023-02-10T02:55:52Z) - L2 proficiency assessment using self-supervised speech representations [35.70742768910494]
This work extends the initial analysis conducted on a self-supervised speech representation based scheme, requiring no speech recognition, to a large scale proficiency test.
The performance of the self-supervised, wav2vec 2.0, system is compared to a high performance hand-crafted assessment system and a BERT-based text system.
Though the wav2vec 2.0 based system is found to be sensitive to the nature of the response, it can be configured to yield comparable performance to systems requiring a speech transcription.
arXiv Detail & Related papers (2022-11-16T11:47:20Z) - Contextual-Utterance Training for Automatic Speech Recognition [65.4571135368178]
We propose a contextual-utterance training technique which makes use of the previous and future contextual utterances.
Also, we propose a dual-mode contextual-utterance training technique for streaming automatic speech recognition (ASR) systems.
The proposed technique is able to reduce both the WER and the average last token emission latency by more than 6% and 40ms relative.
arXiv Detail & Related papers (2022-10-27T08:10:44Z) - Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo
Languages [58.43299730989809]
We introduce Wav2Seq, the first self-supervised approach to pre-train both parts of encoder-decoder models for speech data.
We induce a pseudo language as a compact discrete representation, and formulate a self-supervised pseudo speech recognition task.
This process stands on its own, or can be applied as low-cost second-stage pre-training.
arXiv Detail & Related papers (2022-05-02T17:59:02Z) - Self-supervised Learning with Random-projection Quantizer for Speech
Recognition [51.24368930992091]
We present a simple and effective self-supervised learning approach for speech recognition.
The approach learns a model to predict masked speech signals, in the form of discrete labels.
It achieves similar word-error-rates as previous work using self-supervised learning with non-streaming models.
arXiv Detail & Related papers (2022-02-03T21:29:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.