$A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement
- URL: http://arxiv.org/abs/2507.20890v1
- Date: Mon, 28 Jul 2025 14:41:57 GMT
- Title: $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement
- Authors: Zhecheng Li, Guoxian Song, Yiwei Wang, Zhen Xiong, Junsong Yuan, Yujun Cai,
- Abstract summary: Vision-language models (VLMs) have demonstrated strong performance across a variety of visual understanding tasks.<n>We propose $A2R2$: Advancing Img2La Conversion via Visual Reasoning with Attention-Guided Refinement.<n>For effective evaluation, we introduce a new dataset, Img2La-TexHard-1K, consisting of 1,100 carefully curated and challenging examples.
- Score: 48.856390495568114
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Img2LaTeX is a practically significant task that involves converting mathematical expressions or tabular data from images into LaTeX code. In recent years, vision-language models (VLMs) have demonstrated strong performance across a variety of visual understanding tasks, owing to their generalization capabilities. While some studies have explored the use of VLMs for the Img2LaTeX task, their performance often falls short of expectations. Empirically, VLMs sometimes struggle with fine-grained visual elements, leading to inaccurate LaTeX predictions. To address this challenge, we propose $A^2R^2$: Advancing Img2LaTeX Conversion via Visual Reasoning with Attention-Guided Refinement, a framework that effectively integrates attention localization and iterative refinement within a visual reasoning framework, enabling VLMs to perform self-correction and progressively improve prediction quality. For effective evaluation, we introduce a new dataset, Img2LaTex-Hard-1K, consisting of 1,100 carefully curated and challenging examples designed to rigorously evaluate the capabilities of VLMs within this task domain. Extensive experimental results demonstrate that: (1) $A^2R^2$ significantly improves model performance across six evaluation metrics spanning both textual and visual levels, consistently outperforming other baseline methods; (2) Increasing the number of inference rounds yields notable performance gains, underscoring the potential of $A^2R^2$ in test-time scaling scenarios; (3) Ablation studies and human evaluations validate the practical effectiveness of our approach, as well as the strong synergy among its core components during inference.
Related papers
- Text2Vis: A Challenging and Diverse Benchmark for Generating Multimodal Visualizations from Text [30.74255946385862]
We introduce Text2Vis, a benchmark designed to assess text-to-visualization models.<n>It comprises 1,985 samples, each with a data table, natural language query, short answer, visualization code, and annotated charts.<n>It reveals significant performance gaps, highlighting key challenges, and offering insights for future advancements.
arXiv Detail & Related papers (2025-07-26T14:59:04Z) - Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization [49.2338910653152]
Vision-constrained models (VLMs) have achieved remarkable success across diverse tasks by leveraging rich textual information with minimal labeled data.<n> Knowledge distillation (KD) offers a well-established solution to this problem; however, recent KD approaches from VLMs often involve multi-stage training or additional tuning.<n>We propose $mathbftextttDHO$ -- a simple yet effective KD framework that transfers knowledge from VLMs to compact, task-specific models in semi-language settings.
arXiv Detail & Related papers (2025-05-12T15:39:51Z) - IGDA: Interactive Graph Discovery through Large Language Model Agents [6.704529554100875]
Large language models ($textbfLLMs$) have emerged as a powerful method for discovery.<n>We propose $textbfIGDA$ to be a powerful method for graph discovery complementary to existing numerically driven approaches.
arXiv Detail & Related papers (2025-02-24T14:24:27Z) - Why Vision Language Models Struggle with Visual Arithmetic? Towards Enhanced Chart and Geometry Understanding [94.64781599202882]
Vision Language Models (VLMs) have achieved remarkable progress in multimodal tasks.<n>They often struggle with visual arithmetic, seemingly simple capabilities like object counting or length comparison.<n>We propose CogAlign, a novel post-training strategy inspired by Piaget's theory of cognitive development.
arXiv Detail & Related papers (2025-02-17T06:54:49Z) - Open Eyes, Then Reason: Fine-grained Visual Mathematical Understanding in MLLMs [62.875934732547435]
Current large language models (MLLMs) often underperform on mathematical problem-solving tasks that require fine-grained visual understanding.<n>In this paper, we evaluate the visual grounding capabilities of state-of-the-art MLLMs and reveal a significant negative correlation between visual grounding accuracy and problem-solving performance.<n>We propose a novel approach, SVE-Math, featuring a geometric-grounded vision encoder and a feature router that dynamically adjusts the contribution of hierarchical visual feature maps.
arXiv Detail & Related papers (2025-01-11T04:08:44Z) - NeKo: Toward Post Recognition Generative Correction Large Language Models with Task-Oriented Experts [57.53692236201343]
We propose a Multi-Task Correction MoE, where we train the experts to become an expert'' of speech-to-text, language-to-text and vision-to-text datasets.
NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
arXiv Detail & Related papers (2024-11-08T20:11:24Z) - SimTeG: A Frustratingly Simple Approach Improves Textual Graph Learning [131.04781590452308]
We present SimTeG, a frustratingly Simple approach for Textual Graph learning.
We first perform supervised parameter-efficient fine-tuning (PEFT) on a pre-trained LM on the downstream task.
We then generate node embeddings using the last hidden states of finetuned LM.
arXiv Detail & Related papers (2023-08-03T07:00:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.