Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer
- URL: http://arxiv.org/abs/2412.03853v2
- Date: Sat, 07 Dec 2024 10:55:44 GMT
- Title: Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer
- Authors: Jayaprakash Sundararaj, Akhil Vyas, Benjamin Gonzalez-Maldonado,
- Abstract summary: We examine the application of advanced transformer-based architectures to address the task of converting mathematical expression images into corresponding code.<n>As a baseline, we utilize the current state-of-the-art CNN encoder and LSTM decoder.<n>We also explore enhancements to the CNN-RNN architecture by replacing the CNN encoder with the pretrained ResNet50 model with modification to suite the grey scale input.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Transforming mathematical expressions into LaTeX poses a significant challenge. In this paper, we examine the application of advanced transformer-based architectures to address the task of converting handwritten or digital mathematical expression images into corresponding LaTeX code. As a baseline, we utilize the current state-of-the-art CNN encoder and LSTM decoder. Additionally, we explore enhancements to the CNN-RNN architecture by replacing the CNN encoder with the pretrained ResNet50 model with modification to suite the grey scale input. Further, we experiment with vision transformer model and compare with Baseline and CNN-LSTM model. Our findings reveal that the vision transformer architectures outperform the baseline CNN-RNN framework, delivering higher overall accuracy and BLEU scores while achieving lower Levenshtein distances. Moreover, these results highlight the potential for further improvement through fine-tuning of model parameters. To encourage open research, we also provide the model implementation, enabling reproduction of our results and facilitating further research in this domain.
Related papers
- Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers.
We provide a comprehensive analysis of 3D Attention in the context of video prediction.
The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z) - TeXBLEU: Automatic Metric for Evaluate LaTeX Format [4.337656290539519]
We propose BLEU, a metric for evaluating mathematical expressions in the format built on the n-gram-based BLEU metric.
The proposed BLEU consists of a tokenizer trained on the arXiv paper dataset and a fine-tuned embedding model with positional encoding.
arXiv Detail & Related papers (2024-09-10T16:54:32Z) - Image-to-LaTeX Converter for Mathematical Formulas and Text [0.0]
We build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas.
We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen.
arXiv Detail & Related papers (2024-08-07T18:04:01Z) - Combined CNN and ViT features off-the-shelf: Another astounding baseline for recognition [49.14350399025926]
We apply pre-trained architectures, originally developed for the ImageNet Large Scale Visual Recognition Challenge, for periocular recognition.
Middle-layer features from CNNs and ViTs are a suitable way to recognize individuals based on periocular images.
arXiv Detail & Related papers (2024-07-28T11:52:36Z) - MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition [2.325171167252542]
We present an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one.
Second, we introduce the real-world dataset realFormula, with MEs extracted from papers.
Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets.
arXiv Detail & Related papers (2024-04-21T14:03:34Z) - Self-Supervised Pre-Training for Table Structure Recognition Transformer [25.04573593082671]
We propose a self-supervised pre-training (SSP) method for table structure recognition transformers.
We discover that the performance gap between the linear projection transformer and the hybrid CNN-transformer can be mitigated by SSP of the visual encoder in the TSR model.
arXiv Detail & Related papers (2024-02-23T19:34:06Z) - Dynamic Semantic Compression for CNN Inference in Multi-access Edge
Computing: A Graph Reinforcement Learning-based Autoencoder [82.8833476520429]
We propose a novel semantic compression method, autoencoder-based CNN architecture (AECNN) for effective semantic extraction and compression in partial offloading.
In the semantic encoder, we introduce a feature compression module based on the channel attention mechanism in CNNs, to compress intermediate data by selecting the most informative features.
In the semantic decoder, we design a lightweight decoder to reconstruct the intermediate data through learning from the received compressed data to improve accuracy.
arXiv Detail & Related papers (2024-01-19T15:19:47Z) - Enhancing Diffusion Models with Text-Encoder Reinforcement Learning [63.41513909279474]
Text-to-image diffusion models are typically trained to optimize the log-likelihood objective.
Recent research addresses this issue by refining the diffusion U-Net using human rewards through reinforcement learning or direct backpropagation.
We demonstrate that by finetuning the text encoder through reinforcement learning, we can enhance the text-image alignment of the results.
arXiv Detail & Related papers (2023-11-27T09:39:45Z) - High-Performance Transformers for Table Structure Recognition Need Early
Convolutions [25.04573593082671]
Existing approaches use classic convolutional neural network (CNN) backbones for the visual encoder and transformers for the textual decoder.
We design a lightweight visual encoder for table structure recognition (TSR) without sacrificing expressive power.
We discover that a convolutional stem can match classic CNN backbone performance, with a much simpler model.
arXiv Detail & Related papers (2023-11-09T18:20:52Z) - Unifying Two-Stream Encoders with Transformers for Cross-Modal Retrieval [68.61855682218298]
Cross-modal retrieval methods employ two-stream encoders with different architectures for images and texts.
Inspired by recent advances of Transformers in vision tasks, we propose to unify the encoder architectures with Transformers for both modalities.
We design a cross-modal retrieval framework purely based on two-stream Transformers, dubbed textbfHierarchical Alignment Transformers (HAT), which consists of an image Transformer, a text Transformer, and a hierarchical alignment module.
arXiv Detail & Related papers (2023-08-08T15:43:59Z) - T-former: An Efficient Transformer for Image Inpainting [50.43302925662507]
A class of attention-based network architectures, called transformer, has shown significant performance on natural language processing fields.
In this paper, we design a novel attention linearly related to the resolution according to Taylor expansion, and based on this attention, a network called $T$-former is designed for image inpainting.
Experiments on several benchmark datasets demonstrate that our proposed method achieves state-of-the-art accuracy while maintaining a relatively low number of parameters and computational complexity.
arXiv Detail & Related papers (2023-05-12T04:10:42Z) - Comparative study of Transformer and LSTM Network with attention
mechanism on Image Captioning [0.0]
This study compares Transformer and LSTM with attention block model on MS-COCO dataset.
Transformer and LSTM with attention block models have been discussed with state of the art accuracy.
arXiv Detail & Related papers (2023-03-05T11:45:53Z) - Image Captioning In the Transformer Age [71.06437715212911]
Image Captioning (IC) has achieved astonishing developments by incorporating various techniques into the CNN-RNN encoder-decoder architecture.
This paper analyzes the connections between IC with some popular self-supervised learning paradigms.
arXiv Detail & Related papers (2022-04-15T08:13:39Z) - End-to-End Transformer Based Model for Image Captioning [1.4303104706989949]
Transformer-based model integrates image captioning into one stage and realizes end-to-end training.
Model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models)
arXiv Detail & Related papers (2022-03-29T08:47:46Z) - Handwritten Mathematical Expression Recognition with Bidirectionally
Trained Transformer [2.952085248753861]
A transformer-decoder decoder is employed to replace RNN-based ones.
Experiments demonstrate that our model improves the ExpRate of current state-of-the-art methods on CROHME 2014 by 2.23%.
arXiv Detail & Related papers (2021-05-06T03:11:54Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Exploring Deep Hybrid Tensor-to-Vector Network Architectures for
Regression Based Speech Enhancement [53.47564132861866]
We find that a hybrid architecture, namely CNN-TT, is capable of maintaining a good quality performance with a reduced model parameter size.
CNN-TT is composed of several convolutional layers at the bottom for feature extraction to improve speech quality.
arXiv Detail & Related papers (2020-07-25T22:21:05Z) - Hyperparameter Analysis for Image Captioning [0.0]
We perform a thorough sensitivity analysis on state-of-the-art image captioning approaches using two different architectures: CNN+LSTM and CNN+Transformer.
The biggest takeaway from the experiments is that fine-tuning the CNN encoder outperforms the baseline.
arXiv Detail & Related papers (2020-06-19T01:49:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.