GTC: Guided Training of CTC Towards Efficient and Accurate Scene Text
Recognition
- URL: http://arxiv.org/abs/2002.01276v1
- Date: Tue, 4 Feb 2020 13:26:14 GMT
- Title: GTC: Guided Training of CTC Towards Efficient and Accurate Scene Text
Recognition
- Authors: Wenyang Hu, Xiaocong Cai, Jun Hou, Shuai Yi, Zhiping Lin
- Abstract summary: We propose the guided training of CTC model, where CTC model learns a better alignment and feature representations from a more powerful attentional guidance.
With the benefit of guided training, CTC model achieves robust and accurate prediction for both regular and irregular scene text.
To further leverage the potential of CTC decoder, a graph convolutional network (GCN) is proposed to learn the local correlations of extracted features.
- Score: 27.38969404322089
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Connectionist Temporal Classification (CTC) and attention mechanism are two
main approaches used in recent scene text recognition works. Compared with
attention-based methods, CTC decoder has a much shorter inference time, yet a
lower accuracy. To design an efficient and effective model, we propose the
guided training of CTC (GTC), where CTC model learns a better alignment and
feature representations from a more powerful attentional guidance. With the
benefit of guided training, CTC model achieves robust and accurate prediction
for both regular and irregular scene text while maintaining a fast inference
speed. Moreover, to further leverage the potential of CTC decoder, a graph
convolutional network (GCN) is proposed to learn the local correlations of
extracted features. Extensive experiments on standard benchmarks demonstrate
that our end-to-end model achieves a new state-of-the-art for regular and
irregular scene text recognition and needs 6 times shorter inference time than
attentionbased methods.
Related papers
- CR-CTC: Consistency regularization on CTC for improved speech recognition [18.996929774821822]
Connectionist Temporal Classification (CTC) is a widely used method for automatic speech recognition (ASR)
However, it often falls short in recognition performance compared to transducer or systems combining CTC and attention-based encoder-decoder (CTC/AED)
We propose the Consistency-Regularized CTC (CR-CTC), which enforces consistency between two CTC distributions obtained from different augmented views of the input speech mel-spectrogram.
arXiv Detail & Related papers (2024-10-07T14:56:07Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Self-distillation Regularized Connectionist Temporal Classification Loss
for Text Recognition: A Simple Yet Effective Approach [14.69981874614434]
We show how to better optimize a text recognition model from the perspective of loss functions.
CTC-based methods, widely used in practice due to their good balance between performance and inference speed, still grapple with degradation accuracy.
We propose a self-distillation scheme for CTC-based model to address this issue.
arXiv Detail & Related papers (2023-08-17T06:32:57Z) - Improving CTC-AED model with integrated-CTC and auxiliary loss
regularization [6.214966465876013]
Connectionist temporal classification and attention-based encoder decoder (AED) joint training has been widely applied in automatic speech recognition (ASR)
In this paper, we employ two fusion methods, namely direct addition of logits (DAL) and preserving the maximum probability (PMP)
We achieve dimensional consistency by adaptively affine transforming the attention results to match the dimensions of CTC.
arXiv Detail & Related papers (2023-08-15T03:31:47Z) - Scalable Learning of Latent Language Structure With Logical Offline
Cycle Consistency [71.42261918225773]
Conceptually, LOCCO can be viewed as a form of self-learning where the semantic being trained is used to generate annotations for unlabeled text.
As an added bonus, the annotations produced by LOCCO can be trivially repurposed to train a neural text generation model.
arXiv Detail & Related papers (2023-05-31T16:47:20Z) - Turning a CLIP Model into a Scene Text Detector [56.86413150091367]
Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection.
This paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process.
arXiv Detail & Related papers (2023-02-28T06:06:12Z) - CTC Alignments Improve Autoregressive Translation [145.90587287444976]
We argue that CTC does in fact make sense for translation if applied in a joint CTC/attention framework.
Our proposed joint CTC/attention models outperform pure-attention baselines across six benchmark translation tasks.
arXiv Detail & Related papers (2022-10-11T07:13:50Z) - Distilling the Knowledge of BERT for CTC-based ASR [38.345330002791606]
We propose to distill the knowledge of BERT for CTC-based ASR.
CTC-based ASR learns the knowledge of BERT during training and does not use BERT during testing.
We show that our method improves the performance of CTC-based ASR without the cost of inference speed.
arXiv Detail & Related papers (2022-09-05T16:08:35Z) - Supervision-Guided Codebooks for Masked Prediction in Speech
Pre-training [102.14558233502514]
Masked prediction pre-training has seen remarkable progress in self-supervised learning (SSL) for speech recognition.
We propose two supervision-guided codebook generation approaches to improve automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2022-06-21T06:08:30Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - Focus on the present: a regularization method for the ASR source-target
attention layer [45.73441417132897]
This paper introduces a novel method to diagnose the source-target attention in state-of-the-art end-to-end speech recognition models.
Our method is based on the fact that both, CTC and source-target attention, are acting on the same encoder representations.
We found that the source-target attention heads are able to predict several tokens ahead of the current one.
arXiv Detail & Related papers (2020-11-02T18:56:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.