MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition
- URL: http://arxiv.org/abs/2404.13667v1
- Date: Sun, 21 Apr 2024 14:03:34 GMT
- Title: MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition
- Authors: Felix M. Schmitt-Koopmann, Elaine M. Huang, Hans-Peter Hutter, Thilo Stadelmann, Alireza Darvishy,
- Abstract summary: We present an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one.
Second, we introduce the real-world dataset realFormula, with MEs extracted from papers.
Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets.
- Score: 2.325171167252542
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Printed mathematical expression recognition (MER) models are usually trained and tested using LaTeX-generated mathematical expressions (MEs) as input and the LaTeX source code as ground truth. As the same ME can be generated by various different LaTeX source codes, this leads to unwanted variations in the ground truth data that bias test performance results and hinder efficient learning. In addition, the use of only one font to generate the MEs heavily limits the generalization of the reported results to realistic scenarios. We propose a data-centric approach to overcome this problem, and present convincing experimental results: Our main contribution is an enhanced LaTeX normalization to map any LaTeX ME to a canonical form. Based on this process, we developed an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one. Second, we introduce the real-world dataset realFormula, with MEs extracted from papers. Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets (im2latex-100k, im2latexv2, realFormula, and InftyMDB-1), outperforming the previous state of the art by up to 88.3%.
Related papers
- MixTex: Unambiguous Recognition Should Not Rely Solely on Real Data [0.0]
This paper introduces MixTex, an end-to-end OCR model designed for low-bias multilingual recognition.
We identify specific recognition bias issues, such as the frequent misinterpretation of $e-t$ as $e-t$.
We propose an innovative data augmentation method to mitigate this bias.
arXiv Detail & Related papers (2024-06-24T21:38:36Z) - LLM Critics Help Catch Bugs in Mathematics: Towards a Better Mathematical Verifier with Natural Language Feedback [71.95402654982095]
We propose textbfMath-Minos, a natural language feedback enhanced verifier.
Our experiments reveal that a small set (30k) of natural language feedbacks can significantly boost the performance of the verifier.
arXiv Detail & Related papers (2024-06-20T06:42:27Z) - Fine-Tuning BERTs for Definition Extraction from Mathematical Text [0.0]
We fine-tuned three pre-trained BERT models on the task of "definition extraction"
This is presented as a binary classification problem, where either a sentence contains a definition of a mathematical term or it does not.
We found that a high-performance Sentence-BERT transformer model performed best based on overall accuracy, recall, and precision metrics.
arXiv Detail & Related papers (2024-06-19T20:47:23Z) - ICAL: Implicit Character-Aided Learning for Enhanced Handwritten Mathematical Expression Recognition [9.389169879626428]
This paper introduces a novel approach, Implicit Character-Aided Learning (ICAL), to mine the global expression information.
By modeling and utilizing implicit character information, ICAL achieves a more accurate and context-aware interpretation of handwritten mathematical expressions.
arXiv Detail & Related papers (2024-05-15T02:03:44Z) - Generative AI for Math: Part I -- MathPile: A Billion-Token-Scale
Pretraining Corpus for Math [52.66190891388847]
We introduce textscMathPile, a diverse and high-quality math-centric corpus comprising about 9.5 billion tokens.
Our meticulous data collection and processing efforts included a complex suite of preprocessing.
We hope our textscMathPile can help to enhance the mathematical reasoning abilities of language models.
arXiv Detail & Related papers (2023-12-28T16:55:40Z) - MathCoder: Seamless Code Integration in LLMs for Enhanced Mathematical
Reasoning [52.97768001837269]
We present a method to fine-tune open-source language models, enabling them to use code for modeling and deriving math equations.
We propose a method of generating novel and high-quality datasets with math problems and their code-based solutions.
This approach yields the MathCoder models, a family of models capable of generating code-based solutions for solving challenging math problems.
arXiv Detail & Related papers (2023-10-05T17:52:09Z) - WizardMath: Empowering Mathematical Reasoning for Large Language Models
via Reinforced Evol-Instruct [128.89645483139236]
We present WizardMath, which enhances the mathematical reasoning abilities of Llama-2, by applying our proposed Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method to the domain of math.
Our model even surpasses ChatGPT-3.5, Claude Instant-1, PaLM-2 and Minerva on GSM8k, simultaneously surpasses Text-davinci, PaLM-1 and GPT-3 on MATH.
arXiv Detail & Related papers (2023-08-18T14:23:21Z) - GENIUS: Sketch-based Language Model Pre-training via Extreme and
Selective Masking for Text Generation and Augmentation [76.7772833556714]
We introduce GENIUS: a conditional text generation model using sketches as input.
GENIUS is pre-trained on a large-scale textual corpus with a novel reconstruction from sketch objective.
We show that GENIUS can be used as a strong and ready-to-use data augmentation tool for various natural language processing (NLP) tasks.
arXiv Detail & Related papers (2022-11-18T16:39:45Z) - Syntax-Aware Network for Handwritten Mathematical Expression Recognition [53.130826547287626]
Handwritten mathematical expression recognition (HMER) is a challenging task that has many potential applications.
Recent methods for HMER have achieved outstanding performance with an encoder-decoder architecture.
We propose a simple and efficient method for HMER, which is the first to incorporate syntax information into an encoder-decoder network.
arXiv Detail & Related papers (2022-03-03T09:57:19Z) - ICDAR 2021 Competition on Scientific Table Image Recognition to LaTeX [1.149654395906819]
This paper discusses the dataset, tasks, participants' methods, and results of the ICDAR 2021 Competition on Scientific Table Image Recognition.
We propose two subtasks: reconstruct the structure code from an image, and reconstruct the content code from an image.
This report describes the datasets and ground truth specification, details the performance evaluation metrics used, presents the final results, and summarizes the participating methods.
arXiv Detail & Related papers (2021-05-30T04:17:55Z) - Machine Translation of Mathematical Text [0.0]
We have implemented a machine translation system, the PolyMath Translator, for documents containing mathematical text.
The current implementation translates English to French, attaining a BLEU score of 53.5 on a held-out test corpus of mathematical sentences.
It produces documents that can be compiled to PDF without further editing.
arXiv Detail & Related papers (2020-10-11T11:59:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.