PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer
- URL: http://arxiv.org/abs/2407.07764v1
- Date: Wed, 10 Jul 2024 15:42:58 GMT
- Title: PosFormer: Recognizing Complex Handwritten Mathematical Expression with Position Forest Transformer
- Authors: Tongkun Guan, Chengyu Lin, Wei Shen, Xiaokang Yang,
- Abstract summary: Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios.
We propose a position forest transformer (PosFormer) for HMER, which jointly optimize two tasks: expression recognition and position recognition.
PosFormer consistently outperforms the state-of-the-art methods 2.03%/1.22%/2, 1.83%, and 4.62% gains on datasets.
- Score: 51.260384040953326
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Handwritten Mathematical Expression Recognition (HMER) has wide applications in human-machine interaction scenarios, such as digitized education and automated offices. Recently, sequence-based models with encoder-decoder architectures have been commonly adopted to address this task by directly predicting LaTeX sequences of expression images. However, these methods only implicitly learn the syntax rules provided by LaTeX, which may fail to describe the position and hierarchical relationship between symbols due to complex structural relations and diverse handwriting styles. To overcome this challenge, we propose a position forest transformer (PosFormer) for HMER, which jointly optimizes two tasks: expression recognition and position recognition, to explicitly enable position-aware symbol feature representation learning. Specifically, we first design a position forest that models the mathematical expression as a forest structure and parses the relative position relationships between symbols. Without requiring extra annotations, each symbol is assigned a position identifier in the forest to denote its relative spatial position. Second, we propose an implicit attention correction module to accurately capture attention for HMER in the sequence-based decoder architecture. Extensive experiments validate the superiority of PosFormer, which consistently outperforms the state-of-the-art methods 2.03%/1.22%/2.00%, 1.83%, and 4.62% gains on the single-line CROHME 2014/2016/2019, multi-line M2E, and complex MNE datasets, respectively, with no additional latency or computational cost. Code is available at https://github.com/SJTU-DeepVisionLab/PosFormer.
Related papers
- ICAL: Implicit Character-Aided Learning for Enhanced Handwritten Mathematical Expression Recognition [9.389169879626428]
This paper introduces a novel approach, Implicit Character-Aided Learning (ICAL), to mine the global expression information.
By modeling and utilizing implicit character information, ICAL achieves a more accurate and context-aware interpretation of handwritten mathematical expressions.
arXiv Detail & Related papers (2024-05-15T02:03:44Z) - Semantic Graph Representation Learning for Handwritten Mathematical
Expression Recognition [57.60390958736775]
We propose a simple but efficient method to enhance semantic interaction learning (SIL)
We first construct a semantic graph based on the statistical symbol co-occurrence probabilities.
Then we design a semantic aware module (SAM), which projects the visual and classification feature into semantic space.
Our method achieves better recognition performance than prior arts on both CROHME and HME100K datasets.
arXiv Detail & Related papers (2023-08-21T06:23:41Z) - A Transformer Architecture for Online Gesture Recognition of
Mathematical Expressions [0.0]
Transformer architecture is shown to provide an end-to-end model for building expression trees from online handwritten gestures corresponding to glyph strokes.
The attention mechanism was successfully used to encode, learn and enforce the underlying syntax of expressions.
For the first time, the encoder is fed with unseen online-temporal data tokens potentially forming an infinitely large vocabulary.
arXiv Detail & Related papers (2022-11-04T17:55:55Z) - When Counting Meets HMER: Counting-Aware Network for Handwritten
Mathematical Expression Recognition [57.51793420986745]
We propose an unconventional network for handwritten mathematical expression recognition (HMER) named Counting-Aware Network (CAN)
We design a weakly-supervised counting module that can predict the number of each symbol class without the symbol-level position annotations.
Experiments on the benchmark datasets for HMER validate that both joint optimization and counting results are beneficial for correcting the prediction errors of encoder-decoder models.
arXiv Detail & Related papers (2022-07-23T08:39:32Z) - Syntax-Aware Network for Handwritten Mathematical Expression Recognition [53.130826547287626]
Handwritten mathematical expression recognition (HMER) is a challenging task that has many potential applications.
Recent methods for HMER have achieved outstanding performance with an encoder-decoder architecture.
We propose a simple and efficient method for HMER, which is the first to incorporate syntax information into an encoder-decoder network.
arXiv Detail & Related papers (2022-03-03T09:57:19Z) - CDistNet: Perceiving Multi-Domain Character Distance for Robust Text
Recognition [87.3894423816705]
We propose a novel module called Multi-Domain Character Distance Perception (MDCDP) to establish a visually and semantically related position embedding.
MDCDP uses the position embedding to query both visual and semantic features following the cross-attention mechanism.
We develop CDistNet that stacks multiple MDCDPs to guide a gradually precise distance modeling.
arXiv Detail & Related papers (2021-11-22T06:27:29Z) - Semantic Parsing in Task-Oriented Dialog with Recursive Insertion-based
Encoder [6.507504084891086]
We introduce a Recursive INsertion-based entity recognition (RINE) approach for semantic parsing in task-oriented dialog.
RINE achieves state-of-the-art exact match accuracy on low- and high-resource versions of the conversational semantic parsing benchmark TOP.
Our approach is 2-3.5 times faster than the sequence-to-sequence model at inference time.
arXiv Detail & Related papers (2021-09-09T18:23:45Z) - EDSL: An Encoder-Decoder Architecture with Symbol-Level Features for
Printed Mathematical Expression Recognition [23.658113675853546]
We propose a new method named E, shorted for encoder-decoder with symbol-level features, to identify the printed mathematical expressions from images.
E has achieved 92.7% and 89.0% in evaluation, which are 3.47% and 4.04% higher than the state-of-the-art method.
arXiv Detail & Related papers (2020-07-06T03:53:52Z) - Rethinking Positional Encoding in Language Pre-training [111.2320727291926]
We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations.
We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
arXiv Detail & Related papers (2020-06-28T13:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.