Related papers: Handwritten Mathematical Expression Recognition with Bidirectionally Trained Transformer

Related papers

Exploring Diffusion Transformer Designs via Grafting [82.91123758506876]
We present grafting, a simple approach for editing pretrained diffusion transformers (DiTs) to materialize new architectures under small compute budgets.<n>We show that new diffusion model designs can be explored by grafting pretrained DiTs, with edits ranging from operator replacement to architecture restructuring.
arXiv Detail & Related papers (2025-06-05T17:59:40Z)
One-Way Ticket:Time-Independent Unified Encoder for Distilling Text-to-Image Diffusion Models [65.96186414865747]
Text-to-Image (T2I) diffusion models face a trade-off between inference speed and image quality.<n>We introduce the first Time-independent Unified TiUE for the student model UNet architecture.<n>Using a one-pass scheme, TiUE shares encoder features across multiple decoder time steps, enabling parallel sampling.
arXiv Detail & Related papers (2025-05-28T04:23:22Z)
Token-Shuffle: Towards High-Resolution Image Generation with Autoregressive Models [92.18057318458528]
Token-Shuffle is a novel method that reduces the number of image tokens in Transformer. Our strategy requires no additional pretrained text-encoder and enables MLLMs to support extremely high-resolution image synthesis. In GenAI-benchmark, our 2.7B model achieves 0.77 overall score on hard prompts, outperforming AR models LlamaGen by 0.18 and diffusion models LDM by 0.15.
arXiv Detail & Related papers (2025-04-24T17:59:56Z)
Learning Transformer-based World Models with Contrastive Predictive Coding [58.0159270859475]
We show that the next state prediction objective is insufficient to fully exploit the representation capabilities of Transformers. We propose to extend world model predictions to longer time horizons by introducing TWISTER, a world model using action-conditioned Contrastive Predictive Coding. TWISTER achieves a human-normalized mean score of 162% on the Atari 100k benchmark, setting a new record among state-of-the-art methods that do not employ look-ahead search.
arXiv Detail & Related papers (2025-03-06T13:18:37Z)
Automated LaTeX Code Generation from Handwritten Math Expressions Using Vision Transformer [0.0]
We examine the application of advanced transformer-based architectures to address the task of converting mathematical expression images into corresponding code. As a baseline, we utilize the current state-of-the-art CNN encoder and LSTM decoder. We also explore enhancements to the CNN-RNN architecture by replacing the CNN encoder with the pretrained ResNet50 model with modification to suite the grey scale input.
arXiv Detail & Related papers (2024-12-05T03:58:13Z)
ReNO: Enhancing One-step Text-to-Image Models through Reward-based Noise Optimization [59.72782742378666]
We propose Reward-based Noise Optimization (ReNO) to enhance Text-to-Image models at inference. Within a computational budget of 20-50 seconds, ReNO-enhanced one-step models consistently surpass the performance of all current open-source Text-to-Image models.
arXiv Detail & Related papers (2024-06-06T17:56:40Z)
Arbitrary-Length Generalization for Addition in a Tiny Transformer [55.2480439325792]
This paper introduces a novel training methodology that enables a Transformer model to generalize the addition of two-digit numbers to numbers with unseen lengths of digits. The proposed approach employs an autoregressive generation technique, processing from right to left, which mimics a common manual method for adding large numbers.
arXiv Detail & Related papers (2024-05-31T03:01:16Z)
Bidirectional Trained Tree-Structured Decoder for Handwritten Mathematical Expression Recognition [51.66383337087724]
The Handwritten Mathematical Expression Recognition (HMER) task is a critical branch in the field of OCR. Recent studies have demonstrated that incorporating bidirectional context information significantly improves the performance of HMER models. We propose the Mirror-Flipped Symbol Layout Tree (MF-SLT) and Bidirectional Asynchronous Training (BAT) structure.
arXiv Detail & Related papers (2023-12-31T09:24:21Z)
DenseBAM-GI: Attention Augmented DeneseNet with momentum aided GRU for HMER [4.518012967046983]
It is difficult to accurately determine the length and complex spatial relationships among symbols in handwritten mathematical expressions. In this study, we present a novel encoder-decoder architecture (DenseBAM-GI) for HMER. The proposed model is an efficient and lightweight architecture with performance equivalent to state-of-the-art models in terms of Expression Recognition Rate (exprate)
arXiv Detail & Related papers (2023-06-28T18:12:23Z)
Inflected Forms Are Redundant in Question Generation Models [27.49894653349779]
We propose an approach to enhance the performance of Question Generation using an encoder-decoder framework. Firstly, we identify the inflected forms of words from the input of encoder, and replace them with the root words. Secondly, we propose to adapt QG as a combination of the following actions in the encode-decoder framework: generating a question word, copying a word from the source sequence or generating a word transformation type.
arXiv Detail & Related papers (2023-01-01T13:08:11Z)
CoMER: Modeling Coverage for Transformer-based Handwritten Mathematical Expression Recognition [4.812445272764651]
Transformer-based encoder-decoder architecture has recently made significant advances in recognizing handwritten mathematical expressions. Coverage information, which records the alignment information of the past steps, has proven effective in the RNN models. We propose CoMER, a model that adopts the coverage information in the transformer decoder.
arXiv Detail & Related papers (2022-07-10T07:59:23Z)
End-to-End Transformer Based Model for Image Captioning [1.4303104706989949]
Transformer-based model integrates image captioning into one stage and realizes end-to-end training. Model achieves new state-of-the-art performances of 138.2% (single model) and 141.0% (ensemble of 4 models)
arXiv Detail & Related papers (2022-03-29T08:47:46Z)
Handwritten Mathematical Expression Recognition via Attention Aggregation based Bi-directional Mutual Learning [13.696706205837234]
We propose an Attention aggregation based Bi-directional Mutual learning Network (ABM) In the inference phase, given that the model already learns knowledge from two inverse directions, we only use the L2R branch for inference. Our proposed approach achieves the recognition accuracy of 56.85 % on CROHME 2014, 52.92 % on CROHME 2016, and 53.96 % on CROHME 2019 without data augmentation and model ensembling.
arXiv Detail & Related papers (2021-12-07T09:53:40Z)
Masked Autoencoders Are Scalable Vision Learners [60.97703494764904]
Masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. Coupling these two designs enables us to train large models efficiently and effectively.
arXiv Detail & Related papers (2021-11-11T18:46:40Z)
TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step. To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT) The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z)
Rethinking Semantic Segmentation from a Sequence-to-Sequence Perspective with Transformers [149.78470371525754]
We treat semantic segmentation as a sequence-to-sequence prediction task. Specifically, we deploy a pure transformer to encode an image as a sequence of patches. With the global context modeled in every layer of the transformer, this encoder can be combined with a simple decoder to provide a powerful segmentation model, termed SEgmentation TRansformer (SETR) SETR achieves new state of the art on ADE20K (50.28% mIoU), Pascal Context (55.83% mIoU) and competitive results on Cityscapes.
arXiv Detail & Related papers (2020-12-31T18:55:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.