Towards Scalable Training for Handwritten Mathematical Expression Recognition
- URL: http://arxiv.org/abs/2508.09220v3
- Date: Thu, 11 Sep 2025 15:33:15 GMT
- Title: Towards Scalable Training for Handwritten Mathematical Expression Recognition
- Authors: Haoyang Li, Jiaqing Li, Jialun Cao, Zongyuan Yang, Yongping Xiong,
- Abstract summary: textbfHandwritten textbfMathematical textbfExpression textbfRecognition (HMER) has been impeded by the scarcity of data.<n>We develop a scalable data engine to generate complex and consistent sequences.<n>We propose textttTexTeller, the first HMER model trained at scale, by mix-training textttTex80M with a relatively small HME dataset.
- Score: 11.314986112609922
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large foundation models have achieved significant performance gains through scalable training on massive datasets. However, the field of \textbf{H}andwritten \textbf{M}athematical \textbf{E}xpression \textbf{R}ecognition (HMER) has been impeded by the scarcity of data, primarily due to the arduous and costly process of manual annotation. To bridge this gap, we propose a novel method integrating limited handwritten formulas with large-scale LaTeX-rendered formulas by developing a scalable data engine to generate complex and consistent LaTeX sequences. With this engine, we built the largest formula dataset to date, termed \texttt{Tex80M}, comprising over 80 million high-quality training instances. Then we propose \texttt{TexTeller}, the first HMER model trained at scale, by mix-training \texttt{Tex80M} with a relatively small HME dataset. The expansive training dataset and our refined pipeline have equipped \texttt{TexTeller} with state-of-the-art (SOTA) performance across nearly all benchmarks. To advance the field, we will openly release our complete model, entire dataset, and full codebase, enabling further research building upon our contributions.
Related papers
- Generative Data Transformation: From Mixed to Unified Data [57.84692191369066]
textscTaesar is a emphdata-centric framework for textbftarget-textbfal textbfregeneration.<n>It encodes cross-domain context into target sequences, enabling standard models to learn intricate dependencies without complex fusion architectures.
arXiv Detail & Related papers (2026-02-26T08:30:09Z) - Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models [53.03670032402846]
We address the task of table image to code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs.<n>A central challenge of this task lies in accurately handling complex tables -- those with large sizes, deeply nested structures, and semantically rich or irregular cell content.<n>We propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-La dataset.
arXiv Detail & Related papers (2025-09-22T11:13:48Z) - SPaRFT: Self-Paced Reinforcement Fine-Tuning for Large Language Models [51.74498855100541]
Large language models (LLMs) have shown strong reasoning capabilities when fine-tuned with reinforcement learning (RL)<n>We propose textbfSPaRFT, a self-paced learning framework that enables efficient learning based on the capability of the model being trained.
arXiv Detail & Related papers (2025-08-07T03:50:48Z) - TD3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation [50.23504065567638]
This paper introduces textbfTD3, a novel textbfDataset textbfDistillation method within a meta-learning framework.<n> TD3 distills a fully expressive emphsynthetic sequence summary from original data.<n>An augmentation technique allows the learner to closely fit the synthetic summary, ensuring an accurate update of it in the emphouter-loop.
arXiv Detail & Related papers (2025-02-05T03:13:25Z) - DreamMask: Boosting Open-vocabulary Panoptic Segmentation with Synthetic Data [61.62554324594797]
We propose DreamMask, which explores how to generate training data in the open-vocabulary setting, and how to train the model with both real and synthetic data.<n>In general, DreamMask significantly simplifies the collection of large-scale training data, serving as a plug-and-play enhancement for existing methods.<n>For instance, when trained on COCO and tested on ADE20K, the model equipped with DreamMask outperforms the previous state-of-the-art by a substantial margin of 2.1% mIoU.
arXiv Detail & Related papers (2025-01-03T19:00:00Z) - MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition [2.325171167252542]
We present an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one.
Second, we introduce the real-world dataset realFormula, with MEs extracted from papers.
Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets.
arXiv Detail & Related papers (2024-04-21T14:03:34Z) - MathWriting: A Dataset For Handwritten Mathematical Expression Recognition [0.9012198585960439]
MathWriting is the largest online handwritten mathematical expression dataset to date.<n>One MathWriting sample consists of a formula written on a touch screen and a corresponding expression.<n>This dataset can also be used in its rendered form for offline HME recognition.
arXiv Detail & Related papers (2024-04-16T16:10:23Z) - Hypertext Entity Extraction in Webpage [112.56734676713721]
We introduce a textbfMoE-based textbfEntity textbfExtraction textbfFramework (textitMoEEF), which integrates multiple features to enhance model performance.
We also analyze the effectiveness of hypertext features in textitHEED and several model components in textitMoEEF.
arXiv Detail & Related papers (2024-03-04T03:21:40Z) - How to Choose Pretrained Handwriting Recognition Models for Single
Writer Fine-Tuning [23.274139396706264]
Recent advancements in Deep Learning-based Handwritten Text Recognition (HTR) have led to models with remarkable performance on modern and historical manuscripts.
Those models struggle to obtain the same performance when applied to manuscripts with peculiar characteristics, such as language, paper support, ink, and author handwriting.
In this paper, we take into account large, real benchmark datasets and synthetic ones obtained with a styled Handwritten Text Generation model.
We give a quantitative indication of the most relevant characteristics of such data for obtaining an HTR model able to effectively transcribe manuscripts in small collections with as little as five real fine-tuning lines
arXiv Detail & Related papers (2023-05-04T07:00:28Z) - Efficient Training of Language Models to Fill in the Middle [17.118891860985123]
We show that autoregressive language models can learn to infill text after we apply a straightforward transformation to the dataset.
We use these ablations to prescribe strong default settings and best practices to train FIM models.
We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.
arXiv Detail & Related papers (2022-07-28T17:40:47Z) - Unsupervised Training Data Generation of Handwritten Formulas using
Generative Adversarial Networks with Self-Attention [3.785514121306353]
We introduce a system that creates a large set of synthesized training examples of mathematical expressions which are derived from documents.
For this purpose, we propose a novel attention-based generative adversarial network to translate rendered equations to handwritten formulas.
The datasets generated by this approach contain hundreds of thousands of formulas, making it ideal for pretraining or the design of more complex models.
arXiv Detail & Related papers (2021-06-17T12:27:18Z) - Partially-Aligned Data-to-Text Generation with Distant Supervision [69.15410325679635]
We propose a new generation task called Partially-Aligned Data-to-Text Generation (PADTG)
It is more practical since it utilizes automatically annotated data for training and thus considerably expands the application domains.
Our framework outperforms all baseline models as well as verify the feasibility of utilizing partially-aligned data.
arXiv Detail & Related papers (2020-10-03T03:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.