Mask & Match: Learning to Recognize Handwritten Math with Self-Supervised Attention
- URL: http://arxiv.org/abs/2508.06107v2
- Date: Thu, 28 Aug 2025 17:12:05 GMT
- Title: Mask & Match: Learning to Recognize Handwritten Math with Self-Supervised Attention
- Authors: Shree Mitra, Ritabrata Chakraborty, Nilkanta Sahu,
- Abstract summary: We present a self-supervised learning framework for recognizing handwritten mathematical expressions (HMER)<n>Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss.<n>A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy.
- Score: 0.19116784879310025
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recognizing handwritten mathematical expressions (HMER) is a challenging task due to the inherent two-dimensional structure, varying symbol scales, and complex spatial relationships among symbols. In this paper, we present a self-supervised learning (SSL) framework for HMER that eliminates the need for expensive labeled data. Our approach begins by pretraining an image encoder using a combination of global and local contrastive loss, enabling the model to learn both holistic and fine-grained representations. A key contribution of this work is a novel self-supervised attention network, which is trained using a progressive spatial masking strategy. This attention mechanism is designed to learn semantically meaningful focus regions, such as operators, exponents, and nested mathematical notation, without requiring any supervision. The progressive masking curriculum encourages the network to become increasingly robust to missing or occluded visual information, ultimately improving structural understanding. Our complete pipeline consists of (1) self-supervised pretraining of the encoder, (2) self-supervised attention learning, and (3) supervised fine-tuning with a transformer decoder to generate LATEX sequences. Extensive experiments on CROHME benchmarks demonstrate that our method outperforms existing SSL and fully supervised baselines, validating the effectiveness of our progressive attention mechanism in enhancing HMER performance. Our codebase can be found here.
Related papers
- Robust Representation Learning in Masked Autoencoders [2.599882743586164]
Masked Autoencoders (MAEs) achieve impressive performance in image classification tasks, yet the internal representations they learn remain less understood.<n>This work started as an attempt to understand the strong downstream classification performance of MAE.
arXiv Detail & Related papers (2026-02-03T13:48:34Z) - Revisiting Multi-Task Visual Representation Learning [52.93947931352643]
We introduce MTV, a principled multi-task visual pretraining framework.<n>We leverage high-capacity "expert" models to synthesize dense, structured pseudo-labels at scale.<n>Our results demonstrate that MTV achieves "best-of-both-worlds" performance.
arXiv Detail & Related papers (2026-01-20T11:59:19Z) - Self-Supervised Learning on Molecular Graphs: A Systematic Investigation of Masking Design [11.43518417965958]
Self-supervised learning plays a central role in molecular representation learning.<n>Recent innovations in masking-based pretraining are introduced as obscurings and lack principled evaluation.<n>This work cast the entire pretrain-finetune workflow into a unified probabilistic framework.
arXiv Detail & Related papers (2025-12-08T00:52:46Z) - HERO: Rethinking Visual Token Early Dropping in High-Resolution Large Vision-Language Models [60.028070589466445]
We propose HERO, a framework that integrates content-adaptive token budget allocation with function-aware token selection.<n>This study provides both empirical insights and practical solutions toward efficient inference in HR-LVLMs.
arXiv Detail & Related papers (2025-09-16T13:22:08Z) - Pay Attention to What and Where? Interpretable Feature Extractor in Vision-based Deep Reinforcement Learning [2.713322720372114]
Current approaches in Explainable Deep Reinforcement Learning have limitations in which the attention mask has a displacement with the objects in visual input.<n>We propose the Interpretable Feature Extractor architecture, aimed at generating an accurate attention mask to illustrate both "what" and "where" the agent concentrates on in the spatial domain.<n>The resulting attention mask is consistent, highly understandable by humans, accurate in spatial dimension, and effectively highlights important objects or locations in visual input.
arXiv Detail & Related papers (2025-04-14T10:18:34Z) - Harmonizing Visual Representations for Unified Multimodal Understanding and Generation [53.01486796503091]
We present emphHarmon, a unified autoregressive framework that harmonizes understanding and generation tasks with a shared MAR encoder.<n>Harmon achieves state-of-the-art image generation results on the GenEval, MJHQ30K and WISE benchmarks.
arXiv Detail & Related papers (2025-03-27T20:50:38Z) - SemiHMER: Semi-supervised Handwritten Mathematical Expression Recognition using pseudo-labels [0.0]
We study semi-supervised Handwritten Mathematical Expression Recognition (HMER) via exploring both labeled data and extra unlabeled data.<n>We propose a novel consistency regularization framework, termed SemiHMER, which introduces dual-branch semi-supervised learning.<n>The experimental results demonstrate that our work achieves significant performance improvements.
arXiv Detail & Related papers (2025-02-11T01:39:11Z) - MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition [94.56755080185732]
We propose a Motion-Aware masked autoencoder with Semantic Alignment (MASA) that integrates rich motion cues and global semantic information.
Our framework can simultaneously learn local motion cues and global semantic features for comprehensive sign language representation.
arXiv Detail & Related papers (2024-05-31T08:06:05Z) - UGMAE: A Unified Framework for Graph Masked Autoencoders [67.75493040186859]
We propose UGMAE, a unified framework for graph masked autoencoders.
We first develop an adaptive feature mask generator to account for the unique significance of nodes.
We then design a ranking-based structure reconstruction objective joint with feature reconstruction to capture holistic graph information.
arXiv Detail & Related papers (2024-02-12T19:39:26Z) - Bidirectional Trained Tree-Structured Decoder for Handwritten
Mathematical Expression Recognition [51.66383337087724]
The Handwritten Mathematical Expression Recognition (HMER) task is a critical branch in the field of OCR.
Recent studies have demonstrated that incorporating bidirectional context information significantly improves the performance of HMER models.
We propose the Mirror-Flipped Symbol Layout Tree (MF-SLT) and Bidirectional Asynchronous Training (BAT) structure.
arXiv Detail & Related papers (2023-12-31T09:24:21Z) - Investigating Power laws in Deep Representation Learning [4.996066540156903]
We propose a framework to evaluate the quality of representations in unlabelled datasets.
We estimate the coefficient of the power law, $alpha$, across three key attributes which influence representation learning.
Notably, $alpha$ is computable from the representations without knowledge of any labels, thereby offering a framework to evaluate the quality of representations in unlabelled datasets.
arXiv Detail & Related papers (2022-02-11T18:11:32Z) - Semi-supervised Left Atrium Segmentation with Mutual Consistency
Training [60.59108570938163]
We propose a novel Mutual Consistency Network (MC-Net) for semi-supervised left atrium segmentation from 3D MR images.
Our MC-Net consists of one encoder and two slightly different decoders, and the prediction discrepancies of two decoders are transformed as an unsupervised loss.
We evaluate our MC-Net on the public Left Atrium (LA) database and it obtains impressive performance gains by exploiting the unlabeled data effectively.
arXiv Detail & Related papers (2021-03-04T09:34:32Z) - A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D
Skeleton Based Person Re-Identification [65.18004601366066]
Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages.
This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID.
arXiv Detail & Related papers (2020-09-05T16:06:04Z) - Structural Deep Clustering Network [45.370272344031285]
We propose a Structural Deep Clustering Network (SDCN) to integrate the structural information into deep clustering.
Specifically, we design a delivery operator to transfer the representations learned by autoencoder to the corresponding GCN layer.
In this way, the multiple structures of data, from low-order to high-order, are naturally combined with the multiple representations learned by autoencoder.
arXiv Detail & Related papers (2020-02-05T04:33:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.