BERT Meets CTC: New Formulation of End-to-End Speech Recognition with
Pre-trained Masked Language Model
- URL: http://arxiv.org/abs/2210.16663v2
- Date: Thu, 20 Apr 2023 01:23:54 GMT
- Title: BERT Meets CTC: New Formulation of End-to-End Speech Recognition with
Pre-trained Masked Language Model
- Authors: Yosuke Higuchi, Brian Yan, Siddhant Arora, Tetsuji Ogawa, Tetsunori
Kobayashi, Shinji Watanabe
- Abstract summary: BERT-CTC is a novel formulation of end-to-end speech recognition.
It incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding.
BERT-CTC improves over conventional approaches across variations in speaking styles and languages.
- Score: 40.16332045057132
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents BERT-CTC, a novel formulation of end-to-end speech
recognition that adapts BERT for connectionist temporal classification (CTC).
Our formulation relaxes the conditional independence assumptions used in
conventional CTC and incorporates linguistic knowledge through the explicit
output dependency obtained by BERT contextual embedding. BERT-CTC attends to
the full contexts of the input and hypothesized output sequences via the
self-attention mechanism. This mechanism encourages a model to learn
inner/inter-dependencies between the audio and token representations while
maintaining CTC's training efficiency. During inference, BERT-CTC combines a
mask-predict algorithm with CTC decoding, which iteratively refines an output
sequence. The experimental results reveal that BERT-CTC improves over
conventional approaches across variations in speaking styles and languages.
Finally, we show that the semantic representations in BERT-CTC are beneficial
towards downstream spoken language understanding tasks.
Related papers
- CR-CTC: Consistency regularization on CTC for improved speech recognition [18.996929774821822]
Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR)
However, it often falls short in recognition performance compared to transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED)
We propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram.
arXiv Detail & Related papers (2024-10-07T14:56:07Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Bridging the Gaps of Both Modality and Language: Synchronous Bilingual
CTC for Speech Translation and Speech Recognition [46.41096278421193]
BiL-CTC+ bridges the gap between audio and text as well as between source and target languages.
Our method also yields significant improvements in speech recognition performance.
arXiv Detail & Related papers (2023-09-21T16:28:42Z) - Unimodal Aggregation for CTC-based Speech Recognition [7.6112706449833505]
A unimodal aggregation (UMA) is proposed to segment and integrate the feature frames that belong to the same text token.
UMA learns better feature representations and shortens the sequence length, resulting in lower recognition error and computational complexity.
arXiv Detail & Related papers (2023-09-15T04:34:40Z) - CTC-based Non-autoregressive Speech Translation [51.37920141751813]
We investigate the potential of connectionist temporal classification for non-autoregressive speech translation.
We develop a model consisting of two encoders that are guided by CTC to predict the source and target texts.
Experiments on the MuST-C benchmarks show that our NAST model achieves an average BLEU score of 29.5 with a speed-up of 5.67$times$.
arXiv Detail & Related papers (2023-05-27T03:54:09Z) - CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework.
Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z) - Distilling the Knowledge of BERT for CTC-based ASR [38.345330002791606]
We propose to distill the knowledge of BERT for CTC-based ASR.
CTC-based ASR learns the knowledge of BERT during training and does not use BERT during testing.
We show that our method improves the performance of CTC-based ASR without the cost of inference speed.
arXiv Detail & Related papers (2022-09-05T16:08:35Z) - A Study on Effects of Implicit and Explicit Language Model Information
for DBLSTM-CTC Based Handwriting Recognition [51.36957172200015]
We study the effects of implicit and explicit language model information for DBLSTM-CTC based handwriting recognition.
Even using one million lines of training sentences to train the DBLSTM, using an explicit language model is still helpful.
arXiv Detail & Related papers (2020-07-31T08:23:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.