Rethinking Text Line Recognition Models
- URL: http://arxiv.org/abs/2104.07787v1
- Date: Thu, 15 Apr 2021 21:43:13 GMT
- Title: Rethinking Text Line Recognition Models
- Authors: Daniel Hernandez Diaz, Siyang Qin, Reeve Ingle, Yasuhisa Fujii,
Alessandro Bissacco
- Abstract summary: We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
- Score: 57.47147190119394
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study the problem of text line recognition. Unlike most
approaches targeting specific domains such as scene-text or handwritten
documents, we investigate the general problem of developing a universal
architecture that can extract text from any image, regardless of source or
input modality. We consider two decoder families (Connectionist Temporal
Classification and Transformer) and three encoder modules (Bidirectional LSTMs,
Self-Attention, and GRCLs), and conduct extensive experiments to compare their
accuracy and performance on widely used public datasets of scene and
handwritten text. We find that a combination that so far has received little
attention in the literature, namely a Self-Attention encoder coupled with the
CTC decoder, when compounded with an external language model and trained on
both public and internal data, outperforms all the others in accuracy and
computational complexity. Unlike the more common Transformer-based models, this
architecture can handle inputs of arbitrary length, a requirement for universal
line recognition. Using an internal dataset collected from multiple sources, we
also expose the limitations of current public datasets in evaluating the
accuracy of line recognizers, as the relatively narrow image width and sequence
length distributions do not allow to observe the quality degradation of the
Transformer approach when applied to the transcription of long lines.
Related papers
- General Detection-based Text Line Recognition [15.761142324480165]
We introduce a general detection-based approach to text line recognition, be it printed (OCR) or handwritten (HTR)
Our approach builds on a completely different paradigm than state-of-the-art HTR methods, which rely on autoregressive decoding.
We improve state-of-the-art performances for Chinese script recognition on the CASIA v2 dataset, and for cipher recognition on the Borg and Copiale datasets.
arXiv Detail & Related papers (2024-09-25T17:05:55Z) - UNIT: Unifying Image and Text Recognition in One Vision Encoder [51.140564856352825]
UNIT is a novel training framework aimed at UNifying Image and Text recognition within a single model.
We show that UNIT significantly outperforms existing methods on document-related tasks.
Notably, UNIT retains the original vision encoder architecture, making it cost-free in terms of inference and deployment.
arXiv Detail & Related papers (2024-09-06T08:02:43Z) - Towards Unified Multi-granularity Text Detection with Interactive Attention [56.79437272168507]
"Detect Any Text" is an advanced paradigm that unifies scene text detection, layout analysis, and document page detection into a cohesive, end-to-end model.
A pivotal innovation in DAT is the across-granularity interactive attention module, which significantly enhances the representation learning of text instances.
Tests demonstrate that DAT achieves state-of-the-art performances across a variety of text-related benchmarks.
arXiv Detail & Related papers (2024-05-30T07:25:23Z) - Character Queries: A Transformer-based Approach to On-Line Handwritten
Character Segmentation [4.128716153761773]
We focus on the scenario where the transcription is known beforehand, in which case the character segmentation becomes an assignment problem.
Inspired by the $k$-means clustering algorithm, we view it from the perspective of cluster assignment and present a Transformer-based architecture.
In order to assess the quality of our approach, we create character segmentation ground truths for two popular on-line handwriting datasets.
arXiv Detail & Related papers (2023-09-06T15:19:04Z) - Optimizing Factual Accuracy in Text Generation through Dynamic Knowledge
Selection [71.20871905457174]
Language models (LMs) have revolutionized the way we interact with information, but they often generate nonfactual text.
Previous methods use external knowledge as references for text generation to enhance factuality but often struggle with the knowledge mix-up of irrelevant references.
We present DKGen, which divide the text generation process into an iterative process.
arXiv Detail & Related papers (2023-08-30T02:22:40Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - PARAGRAPH2GRAPH: A GNN-based framework for layout paragraph analysis [6.155943751502232]
We present a language-independent graph neural network (GNN)-based model that achieves competitive results on common document layout datasets.
Our model is suitable for industrial applications, particularly in multi-language scenarios.
arXiv Detail & Related papers (2023-04-24T03:54:48Z) - TRIG: Transformer-Based Text Recognizer with Initial Embedding Guidance [15.72669617789124]
Scene text recognition (STR) is an important bridge between images and text.
Recent methods use a frozen initial embedding to guide the decoder to decode the features to text, leading to a loss of accuracy.
We propose a novel architecture for text recognition, named TRansformer-based text recognizer with Initial embedding Guidance (TRIG)
arXiv Detail & Related papers (2021-11-16T09:10:39Z) - TrOCR: Transformer-based Optical Character Recognition with Pre-trained
Models [47.48019831416665]
We propose an end-to-end text recognition approach with pre-trained image Transformer and text Transformer models, namely TrOCR.
TrOCR is simple but effective, and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets.
Experiments show that the TrOCR model outperforms the current state-of-the-art models on both printed and handwritten text recognition tasks.
arXiv Detail & Related papers (2021-09-21T16:01:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.