DAN: a Segmentation-free Document Attention Network for Handwritten
Document Recognition
- URL: http://arxiv.org/abs/2203.12273v1
- Date: Wed, 23 Mar 2022 08:40:42 GMT
- Title: DAN: a Segmentation-free Document Attention Network for Handwritten
Document Recognition
- Authors: Denis Coquenet and Cl\'ement Chatelain and Thierry Paquet
- Abstract summary: We propose an end-to-end segmentation-free architecture for handwritten document recognition.
The model is trained to label text parts using begin and end tags in an XML-like fashion.
We achieve competitive results on the READ dataset at page level, as well as double-page level with a CER of 3.53% and 3.69%, respectively.
- Score: 1.7875811547963403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unconstrained handwritten document recognition is a challenging computer
vision task. It is traditionally handled by a two-step approach combining line
segmentation followed by text line recognition. For the first time, we propose
an end-to-end segmentation-free architecture for the task of handwritten
document recognition: the Document Attention Network. In addition to the text
recognition, the model is trained to label text parts using begin and end tags
in an XML-like fashion. This model is made up of an FCN encoder for feature
extraction and a stack of transformer decoder layers for a recurrent
token-by-token prediction process. It takes whole text documents as input and
sequentially outputs characters, as well as logical layout tokens. Contrary to
the existing segmentation-based approaches, the model is trained without using
any segmentation label. We achieve competitive results on the READ dataset at
page level, as well as double-page level with a CER of 3.53% and 3.69%,
respectively. We also provide results for the RIMES dataset at page level,
reaching 4.54% of CER.
We provide all source code and pre-trained model weights at
https://github.com/FactoDeepLearning/DAN.
Related papers
- LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding [56.079013202051094]
We present SegVG, a novel method transfers the box-level annotation as signals to provide an additional pixel-level supervision for Visual Grounding.
This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation.
arXiv Detail & Related papers (2024-07-03T15:30:45Z) - Handwritten and Printed Text Segmentation: A Signature Case Study [0.0]
We develop novel approaches to address the challenges of handwritten and printed text segmentation.
Our objective is to recover text from different classes in their entirety, especially enhancing the segmentation performance on overlapping sections.
Our best configuration outperforms prior work on two different datasets by 17.9% and 7.3% on IoU scores.
arXiv Detail & Related papers (2023-07-15T21:49:22Z) - Towards End-to-end Handwritten Document Recognition [0.0]
Handwritten text recognition has been widely studied in the last decades for its numerous applications.
In this thesis, we propose to tackle these issues by performing the handwritten text recognition of whole document in an end-to-end way.
We reached state-of-the-art results at paragraph level on the RIMES 2011, IAM and READ 2016 datasets and outperformed the line-level state of the art on these datasets.
arXiv Detail & Related papers (2022-09-30T10:31:22Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z) - Minimally-Supervised Structure-Rich Text Categorization via Learning on
Text-Rich Networks [61.23408995934415]
We propose a novel framework for minimally supervised categorization by learning from the text-rich network.
Specifically, we jointly train two modules with different inductive biases -- a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning.
Our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%.
arXiv Detail & Related papers (2021-02-23T04:14:34Z) - SPAN: a Simple Predict & Align Network for Handwritten Paragraph
Recognition [2.277447144331876]
We propose an end-to-end recurrence-free Fully Convolutional Network performing OCR at paragraph level without any prior segmentation stage.
The framework is as simple as the one used for the recognition of isolated lines and we achieve competitive results on three popular datasets.
arXiv Detail & Related papers (2021-02-17T13:12:45Z) - End-to-end Handwritten Paragraph Text Recognition Using a Vertical
Attention Network [2.277447144331876]
We propose a unified end-to-end model using hybrid attention to tackle this task.
We achieve state-of-the-art character error rate at line and paragraph levels on three popular datasets.
arXiv Detail & Related papers (2020-12-07T17:31:20Z) - OrigamiNet: Weakly-Supervised, Segmentation-Free, One-Step, Full Page
Text Recognition by learning to unfold [6.09170287691728]
We take a step from segmentation-free single line recognition towards segmentation-free multi-line / full page recognition.
We propose a novel and simple neural network module, termed textbfOrigamiNet, that can augment any CTC-trained, fully convolutional single line text recognizer.
We achieve state-of-the-art character error rate on both IAM & ICDAR 2017 HTR benchmarks for handwriting recognition, surpassing all other methods in the literature.
arXiv Detail & Related papers (2020-06-12T22:18:02Z) - TextScanner: Reading Characters in Order for Robust Scene Text
Recognition [60.04267660533966]
TextScanner is an alternative approach for scene text recognition.
It generates pixel-wise, multi-channel segmentation maps for character class, position and order.
It also adopts RNN for context modeling and performs paralleled prediction for character position and class.
arXiv Detail & Related papers (2019-12-28T07:52:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.