Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces
and Conformers
- URL: http://arxiv.org/abs/2107.03007v2
- Date: Thu, 8 Jul 2021 14:04:04 GMT
- Title: Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces
and Conformers
- Authors: Huahuan Zheng, Wenjie Peng, Zhijian Ou and Jinsong Zhang
- Abstract summary: The proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach.
We investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs.
- Score: 33.725831884078744
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatic speech recognition systems have been largely improved in the past
few decades and current systems are mainly hybrid-based and end-to-end-based.
The recently proposed CTC-CRF framework inherits the data-efficiency of the
hybrid approach and the simplicity of the end-to-end approach. In this paper,
we further advance CTC-CRF based ASR technique with explorations on modeling
units and neural architectures. Specifically, we investigate techniques to
enable the recently developed wordpiece modeling units and Conformer neural
networks to be succesfully applied in CTC-CRFs. Experiments are conducted on
two English datasets (Switchboard, Librispeech) and a German dataset from
CommonVoice. Experimental results suggest that (i) Conformer can improve the
recognition performance significantly; (ii) Wordpiece-based systems perform
slightly worse compared with phone-based systems for the target language with a
low degree of grapheme-phoneme correspondence (e.g. English), while the two
systems can perform equally strong when such degree of correspondence is high
for the target language (e.g. German).
Related papers
- Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Low-resource speech recognition and dialect identification of Irish in a multi-task framework [7.981589711420179]
This paper explores the use of Hybrid CTC/Attention encoder-decoder models trained with Intermediate CTC (Inter CTC) for Irish (Gaelic) low-resource speech recognition (ASR) and dialect identification (DID)
Results are compared to the current best performing models trained for ASR (TDNN-HMM) and DIDECA (PA-TDNN)
arXiv Detail & Related papers (2024-05-02T13:54:39Z) - The evaluation of a code-switched Sepedi-English automatic speech
recognition system [0.0]
We present the evaluation of the Sepedi-English code-switched automatic speech recognition system.
This end-to-end system was developed using the Sepedi Prompted Code Switching corpus and the CTC approach.
The model produced the lowest WER of 41.9%, however, the model faced challenges in recognizing the Sepedi only text.
arXiv Detail & Related papers (2024-03-11T15:11:28Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - LiteG2P: A fast, light and high accuracy model for grapheme-to-phoneme
conversion [18.83348872103488]
Grapheme-to-phoneme (G2P) plays the role of converting letters to their corresponding pronunciations.
Existing methods are either slow or poor in performance, and are limited in application scenarios.
We propose a novel method named LiteG2P which is fast, light and theoretically parallel.
arXiv Detail & Related papers (2023-03-02T09:16:21Z) - Audio-Visual Efficient Conformer for Robust Speech Recognition [91.3755431537592]
We propose to improve the noise of the recently proposed Efficient Conformer Connectionist Temporal Classification architecture by processing both audio and visual modalities.
Our experiments show that using audio and visual modalities allows to better recognize speech in the presence of environmental noise and significantly accelerate training, reaching lower WER with 4 times less training steps.
arXiv Detail & Related papers (2023-01-04T05:36:56Z) - Speech recognition for air traffic control via feature learning and
end-to-end training [8.755785876395363]
We propose a new automatic speech recognition (ASR) system based on feature learning and an end-to-end training procedure for air traffic control (ATC) systems.
The proposed model integrates the feature learning block, recurrent neural network (RNN), and connectionist temporal classification loss.
Thanks to the ability to learn representations from raw waveforms, the proposed model can be optimized in a complete end-to-end manner.
arXiv Detail & Related papers (2021-11-04T06:38:21Z) - Neural Model Reprogramming with Similarity Based Mapping for
Low-Resource Spoken Command Recognition [71.96870151495536]
We propose a novel adversarial reprogramming (AR) approach for low-resource spoken command recognition (SCR)
The AR procedure aims to modify the acoustic signals (from the target domain) to repurpose a pretrained SCR model.
We evaluate the proposed AR-SCR system on three low-resource SCR datasets, including Arabic, Lithuanian, and dysarthric Mandarin speech.
arXiv Detail & Related papers (2021-10-08T05:07:35Z) - Factorized Neural Transducer for Efficient Language Model Adaptation [51.81097243306204]
We propose a novel model, factorized neural Transducer, by factorizing the blank and vocabulary prediction.
It is expected that this factorization can transfer the improvement of the standalone language model to the Transducer for speech recognition.
We demonstrate that the proposed factorized neural Transducer yields 15% to 20% WER improvements when out-of-domain text data is used for language model adaptation.
arXiv Detail & Related papers (2021-09-27T15:04:00Z) - Speech Command Recognition in Computationally Constrained Environments
with a Quadratic Self-organized Operational Layer [92.37382674655942]
We propose a network layer to enhance the speech command recognition capability of a lightweight network.
The employed method borrows the ideas of Taylor expansion and quadratic forms to construct a better representation of features in both input and hidden layers.
This richer representation results in recognition accuracy improvement as shown by extensive experiments on Google speech commands (GSC) and synthetic speech commands (SSC) datasets.
arXiv Detail & Related papers (2020-11-23T14:40:18Z) - Deep Speaker Embeddings for Far-Field Speaker Recognition on Short
Utterances [53.063441357826484]
Speaker recognition systems based on deep speaker embeddings have achieved significant performance in controlled conditions.
Speaker verification on short utterances in uncontrolled noisy environment conditions is one of the most challenging and highly demanded tasks.
This paper presents approaches aimed to achieve two goals: a) improve the quality of far-field speaker verification systems in the presence of environmental noise, reverberation and b) reduce the system qualitydegradation for short utterances.
arXiv Detail & Related papers (2020-02-14T13:34:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.