Handwritten Mathematical Expression Recognition with Bidirectionally
Trained Transformer
- URL: http://arxiv.org/abs/2105.02412v2
- Date: Sun, 9 May 2021 17:00:55 GMT
- Title: Handwritten Mathematical Expression Recognition with Bidirectionally
Trained Transformer
- Authors: Wenqi Zhao, Liangcai Gao, Zuoyu Yan, Shuai Peng, Lin Du, Ziyin Zhang
- Abstract summary: A transformer-decoder decoder is employed to replace RNN-based ones.
Experiments demonstrate that our model improves the ExpRate of current state-of-the-art methods on CROHME 2014 by 2.23%.
- Score: 2.952085248753861
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Encoder-decoder models have made great progress on handwritten mathematical
expression recognition recently. However, it is still a challenge for existing
methods to assign attention to image features accurately. Moreover, those
encoder-decoder models usually adopt RNN-based models in their decoder part,
which makes them inefficient in processing long $\LaTeX{}$ sequences. In this
paper, a transformer-based decoder is employed to replace RNN-based ones, which
makes the whole model architecture very concise. Furthermore, a novel training
strategy is introduced to fully exploit the potential of the transformer in
bidirectional language modeling. Compared to several methods that do not use
data augmentation, experiments demonstrate that our model improves the ExpRate
of current state-of-the-art methods on CROHME 2014 by 2.23%. Similarly, on
CROHME 2016 and CROHME 2019, we improve the ExpRate by 1.92% and 2.28%
respectively.
Related papers
- ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization [59.72782742378666]
We propose Reward-based Noise Optimization (ReNO) to enhance Text-to-Image models at inference.
Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models.
arXiv Detail & Related papers (2024-06-06T17:56:40Z) - Arbitrary-Length Generalization for Addition in a Tiny Transformer [55.2480439325792]
This paper introduces a novel training methodology that enables a Transformer model to generalize the addition of two-digit numbers to numbers with unseen lengths of digits.
The proposed approach employs an autoregressive generation technique, processing from right to left, which mimics a common manual method for adding large numbers.
arXiv Detail & Related papers (2024-05-31T03:01:16Z) - Bidirectional Trained Tree-Structured Decoder for Handwritten
Mathematical Expression Recognition [51.66383337087724]
The Handwritten Mathematical Expression Recognition (HMER) task is a critical branch in the field of OCR.
Recent studies have demonstrated that incorporating bidirectional context information significantly improves the performance of HMER models.
We propose the Mirror-Flipped Symbol Layout Tree (MF-SLT) and Bidirectional Asynchronous Training (BAT) structure.
arXiv Detail & Related papers (2023-12-31T09:24:21Z) - DenseBAM-GI: Attention Augmented DeneseNet with momentum aided GRU for
HMER [4.518012967046983]
It is difficult to accurately determine the length and complex spatial relationships among symbols in handwritten mathematical expressions.
In this study, we present a novel encoder-decoder architecture (DenseBAM-GI) for HMER.
The proposed model is an efficient and lightweight architecture with performance equivalent to state-of-the-art models in terms of Expression Recognition Rate (exprate)
arXiv Detail & Related papers (2023-06-28T18:12:23Z) - Inflected Forms Are Redundant in Question Generation Models [27.49894653349779]
We propose an approach to enhance the performance of Question Generation using an encoder-decoder framework.
Firstly, we identify the inflected forms of words from the input of encoder, and replace them with the root words.
Secondly, we propose to adapt QG as a combination of the following actions in the encode-decoder framework: generating a question word, copying a word from the source sequence or generating a word transformation type.
arXiv Detail & Related papers (2023-01-01T13:08:11Z) - CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical
Expression Recognition [4.812445272764651]
Transformer-based encoder-decoder architecture has recently made significant advances in recognizing handwritten mathematical expressions.
Coverage information, which records the alignment information of the past steps, has proven effective in the RNN models.
We propose CoMER, a model that adopts the coverage information in the transformer decoder.
arXiv Detail & Related papers (2022-07-10T07:59:23Z) - End-to-End Transformer Based Model for Image Captioning [1.4303104706989949]
Transformer-based model integrates image captioning into one stage and realizes end-to-end training.
Model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models)
arXiv Detail & Related papers (2022-03-29T08:47:46Z) - Handwritten Mathematical Expression Recognition via Attention
Aggregation based Bi-directional Mutual Learning [13.696706205837234]
We propose an Attention aggregation based Bi-directional Mutual learning Network (ABM)
In the inference phase, given that the model already learns knowledge from two inverse directions, we only use the L2R branch for inference.
Our proposed approach achieves the recognition accuracy of 56.85 % on CROHME 2014, 52.92 % on CROHME 2016, and 53.96 % on CROHME 2019 without data augmentation and model ensembling.
arXiv Detail & Related papers (2021-12-07T09:53:40Z) - Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision.
Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels.
Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z) - Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective
with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches.
With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR)
SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.