Faster DAN: Multi-target Queries with Document Positional Encoding for
End-to-end Handwritten Document Recognition
- URL: http://arxiv.org/abs/2301.10593v1
- Date: Wed, 25 Jan 2023 13:55:14 GMT
- Title: Faster DAN: Multi-target Queries with Document Positional Encoding for
End-to-end Handwritten Document Recognition
- Authors: Denis Coquenet and Cl\'ement Chatelain and Thierry Paquet
- Abstract summary: Faster DAN is a two-step strategy to speed up the recognition process at prediction time.
It is at least 4 times faster on whole single-page and double-page images of the RIMES 2009, READ 2016 and MAURDOR datasets.
- Score: 1.7875811547963403
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in handwritten text recognition enabled to recognize whole
documents in an end-to-end way: the Document Attention Network (DAN) recognizes
the characters one after the other through an attention-based prediction
process until reaching the end of the document. However, this autoregressive
process leads to inference that cannot benefit from any parallelization
optimization. In this paper, we propose Faster DAN, a two-step strategy to
speed up the recognition process at prediction time: the model predicts the
first character of each text line in the document, and then completes all the
text lines in parallel through multi-target queries and a specific document
positional encoding scheme. Faster DAN reaches competitive results compared to
standard DAN, while being at least 4 times faster on whole single-page and
double-page images of the RIMES 2009, READ 2016 and MAURDOR datasets. Source
code and trained model weights are available at
https://github.com/FactoDeepLearning/FasterDAN.
Related papers
- Contextual Document Embeddings [77.22328616983417]
We propose two complementary methods for contextualized document embeddings.
First, an alternative contrastive learning objective that explicitly incorporates the document neighbors into the intra-batch contextual loss.
Second, a new contextual architecture that explicitly encodes neighbor document information into the encoded representation.
arXiv Detail & Related papers (2024-10-03T14:33:34Z) - mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding [103.05835688963947]
We propose a High-resolution DocCompressor module to compress each high-resolution document image into 324 tokens.
DocOwl2 sets a new state-of-the-art across multi-page document understanding benchmarks and reduces first token latency by more than 50%.
Compared to single-image MLLMs trained on similar data, our DocOwl2 achieves comparable single-page understanding performance with less than 20% of the visual tokens.
arXiv Detail & Related papers (2024-09-05T11:09:00Z) - Planning Ahead in Generative Retrieval: Guiding Autoregressive Generation through Simultaneous Decoding [23.061797784952855]
This paper introduces PAG, a novel optimization and decoding approach that guides autoregressive generation of document identifiers.
Experiments on MSMARCO and TREC Deep Learning Track data reveal that PAG outperforms the state-of-the-art generative retrieval model by a large margin.
arXiv Detail & Related papers (2024-04-22T21:50:01Z) - REXEL: An End-to-end Model for Document-Level Relation Extraction and Entity Linking [11.374031643273941]
REXEL is a highly efficient and accurate model for the joint task of document level cIE (DocIE)
It is on average 11 times faster than competitive existing approaches in a similar setting.
The combination of speed and accuracy makes REXEL an accurate cost-efficient system for extracting structured information at web-scale.
arXiv Detail & Related papers (2024-04-19T11:04:27Z) - Context Perception Parallel Decoder for Scene Text Recognition [52.620841341333524]
Scene text recognition methods have struggled to attain high accuracy and fast inference speed.
We present an empirical study of AR decoding in STR, and discover that the AR decoder not only models linguistic context, but also provides guidance on visual context perception.
We construct a series of CPPD models and also plug the proposed modules into existing STR decoders. Experiments on both English and Chinese benchmarks demonstrate that the CPPD models achieve highly competitive accuracy while running approximately 8x faster than their AR-based counterparts.
arXiv Detail & Related papers (2023-07-23T09:04:13Z) - Towards End-to-end Handwritten Document Recognition [0.0]
Handwritten text recognition has been widely studied in the last decades for its numerous applications.
In this thesis, we propose to tackle these issues by performing the handwritten text recognition of whole document in an end-to-end way.
We reached state-of-the-art results at paragraph level on the RIMES 2011, IAM and READ 2016 datasets and outperformed the line-level state of the art on these datasets.
arXiv Detail & Related papers (2022-09-30T10:31:22Z) - Learning Diverse Document Representations with Deep Query Interactions
for Dense Retrieval [79.37614949970013]
We propose a new dense retrieval model which learns diverse document representations with deep query interactions.
Our model encodes each document with a set of generated pseudo-queries to get query-informed, multi-view document representations.
arXiv Detail & Related papers (2022-08-08T16:00:55Z) - Autoregressive Search Engines: Generating Substrings as Document
Identifiers [53.0729058170278]
Autoregressive language models are emerging as the de-facto standard for generating answers.
Previous work has explored ways to partition the search space into hierarchical structures.
In this work we propose an alternative that doesn't force any structure in the search space: using all ngrams in a passage as its possible identifiers.
arXiv Detail & Related papers (2022-04-22T10:45:01Z) - DAN: a Segmentation-free Document Attention Network for Handwritten
Document Recognition [1.7875811547963403]
We propose an end-to-end segmentation-free architecture for handwritten document recognition.
The model is trained to label text parts using begin and end tags in an XML-like fashion.
We achieve competitive results on the READ dataset at page level, as well as double-page level with a CER of 3.53% and 3.69%, respectively.
arXiv Detail & Related papers (2022-03-23T08:40:42Z) - SDR: Efficient Neural Re-ranking using Succinct Document Representation [4.9278175139681215]
We propose the Succinct Document Representation scheme that computes emphhighly compressed intermediate document representations.
Our method is highly efficient, achieving 4x-11.6x better compression rates for the same ranking quality.
arXiv Detail & Related papers (2021-10-03T07:43:16Z) - Rethinking Text Line Recognition Models [57.47147190119394]
We consider two decoder families (Connectionist Temporal Classification and Transformer) and three encoder modules (Bidirectional LSTMs, Self-Attention, and GRCLs)
We compare their accuracy and performance on widely used public datasets of scene and handwritten text.
Unlike the more common Transformer-based models, this architecture can handle inputs of arbitrary length.
arXiv Detail & Related papers (2021-04-15T21:43:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.