Related papers: Image-to-LaTeX Converter for Mathematical Formulas and Text

Image-to-LaTeX Converter for Mathematical Formulas and Text

URL: http://arxiv.org/abs/2408.04015v1
Date: Wed, 7 Aug 2024 18:04:01 GMT
Title: Image-to-LaTeX Converter for Mathematical Formulas and Text
Authors: Daniil Gurgurov, Aleksey Morshnev,
Abstract summary: We build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas. We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this project, we train a vision encoder-decoder model to generate LaTeX code from images of mathematical formulas and text. Utilizing a diverse collection of image-to-LaTeX data, we build two models: a base model with a Swin Transformer encoder and a GPT-2 decoder, trained on machine-generated images, and a fine-tuned version enhanced with Low-Rank Adaptation (LoRA) trained on handwritten formulas. We then compare the BLEU performance of our specialized model on a handwritten test set with other similar models, such as Pix2Text, TexTeller, and Sumen. Through this project, we contribute open-source models for converting images to LaTeX and provide from-scratch code for building these models with distributed training and GPU optimizations.

Related papers

Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer [0.0]
We examine the application of advanced transformer-based architectures to address the task of converting mathematical expression images into corresponding code. As a baseline, we utilize the current state-of-the-art CNN encoder and LSTM decoder. We also explore enhancements to the CNN-RNN architecture by replacing the CNN encoder with the pretrained ResNet50 model with modification to suite the grey scale input.
arXiv Detail & Related papers (2024-12-05T03:58:13Z)
Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language. In this paper, we propose an end-to-end IIMT model consisting of four modules. Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z)
MathNet: A Data-Centric Approach for Printed Mathematical Expression Recognition [2.325171167252542]
We present an improved version of the benchmark dataset im2latex-100k, featuring 30 fonts instead of one. Second, we introduce the real-world dataset realFormula, with MEs extracted from papers. Third, we developed a MER model, MathNet, based on a convolutional vision transformer, with superior results on all four test sets.
arXiv Detail & Related papers (2024-04-21T14:03:34Z)
MathWriting: A Dataset For Handwritten Mathematical Expression Recognition [0.9012198585960439]
MathWriting is the largest online handwritten mathematical expression dataset to date. One MathWriting sample consists of a formula written on a touch screen and a corresponding expression. This dataset can also be used in its rendered form for offline HME recognition.
arXiv Detail & Related papers (2024-04-16T16:10:23Z)
BLIP-Diffusion: Pre-trained Subject Representation for Controllable Text-to-Image Generation and Editing [73.74570290836152]
BLIP-Diffusion is a new subject-driven image generation model that supports multimodal control. Unlike other subject-driven generation models, BLIP-Diffusion introduces a new multimodal encoder which is pre-trained to provide subject representation.
arXiv Detail & Related papers (2023-05-24T04:51:04Z)
GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation [143.81719619351335]
Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. The tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. We propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model.
arXiv Detail & Related papers (2023-03-17T15:37:07Z)
Swinv2-Imagen: Hierarchical Vision Transformer Diffusion Models for Text-to-Image Generation [25.14323931233249]
We propose a text-to-image diffusion model based on a Hierarchical Visual Transformer and a Scene Graph incorporating a semantic layout. In the proposed model, the feature vectors of entities and relationships are extracted and involved in the diffusion model. We also introduce a Swin-Transformer-based UNet architecture, called Swinv2-Unet, which can address the problems stemming from the CNN convolution operations.
arXiv Detail & Related papers (2022-10-18T02:50:34Z)
Zero-Shot Image-to-Text Generation for Visual-Semantic Arithmetic [72.60554897161948]
Recent text-to-image matching models apply contrastive learning to large corpora of uncurated pairs of images and sentences. In this work, we repurpose such models to generate a descriptive text given an image at inference time. The resulting captions are much less restrictive than those obtained by supervised captioning methods.
arXiv Detail & Related papers (2021-11-29T11:01:49Z)
LAFITE: Towards Language-Free Training for Text-to-Image Generation [83.2935513540494]
We propose the first work to train text-to-image generation models without any text data. Our method leverages the well-aligned multi-modal semantic space of the powerful pre-trained CLIP model. We obtain state-of-the-art results in the standard text-to-image generation tasks.
arXiv Detail & Related papers (2021-11-27T01:54:45Z)
Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer [2.952085248753861]
A transformer-decoder decoder is employed to replace RNN-based ones. Experiments demonstrate that our model improves the ExpRate of current state-of-the-art methods on CROHME 2014 by 2.23%.
arXiv Detail & Related papers (2021-05-06T03:11:54Z)
XGPT: Cross-modal Generative Pre-Training for Image Captioning [80.26456233277435]
XGPT is a new method of Cross-modal Generative Pre-Training for Image Captioning. It is designed to pre-train text-to-image caption generators through three novel generation tasks. XGPT can be fine-tuned without any task-specific architecture modifications.
arXiv Detail & Related papers (2020-03-03T12:13:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.