Image-to-LaTeX Converter for Mathematical Formulas and Text
- URL: http://arxiv.org/abs/2408.04015v1
- Date: Wed, 7 Aug 2024 18:04:01 GMT
- Title: Image-to-LaTeX Converter for Mathematical Formulas and Text
- Authors: Daniil Gurgurov, Aleksey Morshnev,
- Abstract summary: We build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas.
We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this project, we train a vision encoder-decoder model to generate LaTeX code from images of mathematical formulas and text. Utilizing a diverse collection of image-to-LaTeX data, we build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas. We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen. Through this project, we contribute open-source models for converting images to LaTeX and provide from-scratch code for building these models with distributed training and GPU optimizations.
Related papers
- Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer [0.0]
We examine the application of advanced transformer-based architectures to address the task of converting mathematical expression images into corresponding code.
As a baseline, we utilize the current state-of-the-art CNN encoder and LSTM decoder.
We also explore enhancements to the CNN-RNN architecture by replacing the CNN encoder with the pretrained ResNet50 model with modification to suite the grey scale input.
arXiv Detail & Related papers (2024-12-05T03:58:13Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition [2.325171167252542]
We present an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one.
Second, we introduce the real-world dataset realFormula, with MEs extracted from papers.
Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets.
arXiv Detail & Related papers (2024-04-21T14:03:34Z) - MathWriting: A Dataset For Handwritten Mathematical Expression Recognition [0.9012198585960439]
MathWriting is the largest online handwritten mathematical expression dataset to date.
One MathWriting sample consists of a formula written on a touch screen and a corresponding expression.
This dataset can also be used in its rendered form for offline HME recognition.
arXiv Detail & Related papers (2024-04-16T16:10:23Z) - BLIP-Diffusion: Pre-trained Subject Representation for Controllable
Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control.
Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z) - GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions.
The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade.
We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
arXiv Detail & Related papers (2023-03-17T15:37:07Z) - Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for
Text-to-Image Generation [25.14323931233249]
We propose a text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout.
In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model.
We also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations.
arXiv Detail & Related papers (2022-10-18T02:50:34Z) - Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences.
In this work, we repurpose such models to generate a descriptive text given an image at inference time.
The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z) - LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data.
Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model.
We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z) - Handwritten Mathematical Expression Recognition with Bidirectionally
Trained Transformer [2.952085248753861]
A transformer-decoder decoder is employed to replace RNN-based ones.
Experiments demonstrate that our model improves the ExpRate of current state-of-the-art methods on CROHME 2014 by 2.23%.
arXiv Detail & Related papers (2021-05-06T03:11:54Z) - XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning.
It is designed to pre-train text-to-image caption generators through three novel generation tasks.
XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.