Easter2.0: Improving convolutional models for handwritten text
recognition
- URL: http://arxiv.org/abs/2205.14879v1
- Date: Mon, 30 May 2022 06:33:15 GMT
- Title: Easter2.0: Improving convolutional models for handwritten text
recognition
- Authors: Kartik Chaudhary, Raghav Bali
- Abstract summary: We propose a CNN based architecture that bridges this gap.
Easter2.0 is composed of multiple layers of 1D Convolution, Batch Normalization, ReLU, Dropout, Dense Residual connection, Squeeze-and-Excitation module.
Our work achieves state-of-the-art results on IAM handwriting database when trained using only publicly available training data.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Convolutional Neural Networks (CNN) have shown promising results for the task
of Handwritten Text Recognition (HTR) but they still fall behind Recurrent
Neural Networks (RNNs)/Transformer based models in terms of performance. In
this paper, we propose a CNN based architecture that bridges this gap. Our
work, Easter2.0, is composed of multiple layers of 1D Convolution, Batch
Normalization, ReLU, Dropout, Dense Residual connection, Squeeze-and-Excitation
module and make use of Connectionist Temporal Classification (CTC) loss. In
addition to the Easter2.0 architecture, we propose a simple and effective data
augmentation technique 'Tiling and Corruption (TACO)' relevant for the task of
HTR/OCR. Our work achieves state-of-the-art results on IAM handwriting database
when trained using only publicly available training data. In our experiments,
we also present the impact of TACO augmentations and Squeeze-and-Excitation
(SE) on text recognition accuracy. We further show that Easter2.0 is suitable
for few-shot learning tasks and outperforms current best methods including
Transformers when trained on limited amount of annotated data. Code and model
is available at: https://github.com/kartikgill/Easter2
Related papers
- NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - Contrastive Transformer Learning with Proximity Data Generation for
Text-Based Person Search [60.626459715780605]
Given a descriptive text query, text-based person search aims to retrieve the best-matched target person from an image gallery.
Such a cross-modal retrieval task is quite challenging due to significant modality gap, fine-grained differences and insufficiency of annotated data.
In this paper, we propose a simple yet effective dual Transformer model for text-based person search.
arXiv Detail & Related papers (2023-11-15T16:26:49Z) - TVTSv2: Learning Out-of-the-box Spatiotemporal Visual Representations at
Scale [59.01246141215051]
We analyze the factor that leads to degradation from the perspective of language supervision.
We propose a tunable-free pre-training strategy to retain the generalization ability of the text encoder.
We produce a series of models, dubbed TVTSv2, with up to one billion parameters.
arXiv Detail & Related papers (2023-05-23T15:44:56Z) - A Likelihood Ratio based Domain Adaptation Method for E2E Models [10.510472957585646]
End-to-end (E2E) automatic speech recognition models like Recurrent Neural Networks Transducer (RNN-T) are becoming a popular choice for streaming ASR applications like voice assistants.
While E2E models are very effective at learning representation of the training data they are trained on, their accuracy on unseen domains remains a challenging problem.
In this work, we explore a contextual biasing approach using likelihood-ratio that leverages text data sources to adapt RNN-T model to new domains and entities.
arXiv Detail & Related papers (2022-01-10T21:22:39Z) - Handwritten text generation and strikethrough characters augmentation [0.04893345190925178]
We introduce two data augmentation techniques, which, used with a Resnet-BiLSTM-CTC network, significantly reduce Word Error Rate (WER) and Character Error Rate (CER)
We apply a novel augmentation that simulates strikethrough text (HandWritten Blots) and a handwritten text generation method based on printed text (StackMix)
Experiments on ten handwritten text datasets show that HandWritten Blots augmentation and StackMix significantly improve the quality of HTR models.
arXiv Detail & Related papers (2021-12-14T13:41:10Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Train your classifier first: Cascade Neural Networks Training from upper
layers to lower layers [54.47911829539919]
We develop a novel top-down training method which can be viewed as an algorithm for searching for high-quality classifiers.
We tested this method on automatic speech recognition (ASR) tasks and language modelling tasks.
The proposed method consistently improves recurrent neural network ASR models on Wall Street Journal, self-attention ASR models on Switchboard, and AWD-LSTM language models on WikiText-2.
arXiv Detail & Related papers (2021-02-09T08:19:49Z) - EASTER: Efficient and Scalable Text Recognizer [0.0]
We present an Efficient And Scalable TExt Recognizer (EASTER) to perform optical character recognition on both machine printed and handwritten text.
Our model utilise 1-D convolutional layers without any recurrence which enables parallel training with considerably less volume of data.
We also showcase improvements over the current best results on offline handwritten text recognition task.
arXiv Detail & Related papers (2020-08-18T10:26:03Z) - Passive Batch Injection Training Technique: Boosting Network Performance
by Injecting Mini-Batches from a different Data Distribution [39.8046809855363]
This work presents a novel training technique for deep neural networks that makes use of additional data from a distribution that is different from that of the original input data.
To the best of our knowledge, this is the first work that makes use of different data distribution to aid the training of convolutional neural networks (CNNs)
arXiv Detail & Related papers (2020-06-08T08:17:32Z) - Lipreading using Temporal Convolutional Networks [57.41253104365274]
Current model for recognition of isolated words in-the-wild consists of a residual network and Bi-directional Gated Recurrent Unit layers.
We address the limitations of this model and we propose changes which further improve its performance.
Our proposed model results in an absolute improvement of 1.2% and 3.2%, respectively, in these datasets.
arXiv Detail & Related papers (2020-01-23T17:49:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.