Simple Vision-Language Math Reasoning via Rendered Text
- URL: http://arxiv.org/abs/2511.11704v1
- Date: Wed, 12 Nov 2025 15:04:44 GMT
- Title: Simple Vision-Language Math Reasoning via Rendered Text
- Authors: Matvey Skripkin, Elizaveta Goncharova, Andrey Kuznetsov,
- Abstract summary: We present a lightweight yet effective pipeline for training vision-language models to solve math problems.<n>This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy.
- Score: 7.237955967317942
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a lightweight yet effective pipeline for training vision-language models to solve math problems by rendering LaTeX encoded equations into images and pairing them with structured chain-of-thought prompts. This simple text-to-vision augmentation enables compact multimodal architectures to achieve state-of-the-art reasoning accuracy. Through systematic ablations, we find that rendering fidelity and prompt design are the primary drivers of performance. Despite its simplicity, our approach consistently matches or surpasses both open-source and proprietary math-focused vision-language solvers on widely used benchmarks, while preserving broad general-domain competence - showing gains on tasks such as MMMU, ChartQA, and DocVQA of up to 20%.
Related papers
- ViSS-R1: Self-Supervised Reinforcement Video Reasoning [84.1180294023835]
We introduce a novel self-supervised reinforcement learning GRPO algorithm (Pretext-GRPO) within the standard R1 pipeline.<n>We also propose the ViSS-R1 framework, which streamlines and integrates pretext-task-based self-supervised learning directly into the MLLM's R1 post-training paradigm.
arXiv Detail & Related papers (2025-11-17T07:00:42Z) - VEHME: A Vision-Language Model For Evaluating Handwritten Mathematics Expressions [11.210768330027674]
We introduce VEHME-a Vision-Language Model for evaluating handwritten math responses with high accuracy and interpretable reasoning traces.<n> VEHME integrates a two-phase training pipeline: (i) supervised fine-tuning using structured reasoning data, and (ii) reinforcement learning that aligns model outputs with multi-dimensional grading objectives.<n> VEHME achieves state-of-the-art performance among open-source models and approaches the accuracy of proprietary systems.
arXiv Detail & Related papers (2025-10-26T19:03:27Z) - Table2LaTeX-RL: High-Fidelity LaTeX Code Generation from Table Images via Reinforced Multimodal Language Models [53.03670032402846]
We address the task of table image to code generation, with the goal of automating the reconstruction of high-quality, publication-ready tables from visual inputs.<n>A central challenge of this task lies in accurately handling complex tables -- those with large sizes, deeply nested structures, and semantically rich or irregular cell content.<n>We propose a reinforced multimodal large language model (MLLM) framework, where a pre-trained MLLM is fine-tuned on a large-scale table-to-La dataset.
arXiv Detail & Related papers (2025-09-22T11:13:48Z) - ChartMind: A Comprehensive Benchmark for Complex Real-world Multimodal Chart Question Answering [14.468507852394923]
Chart question answering (CQA) has become a critical multimodal task for evaluating the reasoning capabilities of vision-language models.<n>We introduce ChartMind, a new benchmark designed for complex CQA tasks in real-world settings.<n>We propose a context-aware yet model-agnostic framework, ChartLLM, that focuses on extracting key contextual elements.
arXiv Detail & Related papers (2025-05-29T08:46:03Z) - Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning [122.81815833343026]
We introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding.<n>Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements.<n>On ChartQA, our approach improves accuracy from 70.88% (language-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT.
arXiv Detail & Related papers (2025-05-26T08:54:14Z) - Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment [20.953645420787527]
We train a CLIP-like model with only a fraction of the computational cost compared to CLIP.<n>We achieve state-of-the-art results in zero-shot classification and open-vocabulary semantic segmentation.
arXiv Detail & Related papers (2024-12-20T20:46:48Z) - VEGA: Learning Interleaved Image-Text Comprehension in Vision-Language Large Models [76.94378391979228]
We introduce a new, more demanding task known as Interleaved Image-Text (IITC)
This task challenges models to discern and disregard superfluous elements in both images and text to accurately answer questions.
In support of this task, we further craft a new VEGA dataset, tailored for the IITC task on scientific content, and devised a subtask, Image-Text Association (ITA)
arXiv Detail & Related papers (2024-06-14T17:59:40Z) - mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal
Skip-connections [104.14624185375897]
mPLUG is a new vision-language foundation model for both cross-modal understanding and generation.
It achieves state-of-the-art results on a wide range of vision-language downstream tasks, such as image captioning, image-text retrieval, visual grounding and visual question answering.
arXiv Detail & Related papers (2022-05-24T11:52:06Z) - Interpretable Neural Computation for Real-World Compositional Visual
Question Answering [4.3668650778541895]
We build an interpretable framework for real-world compositional VQA.
In our framework, images and questions are disentangled into scene graphs and programs, and a symbolic program runs on them with full transparency to select the attention regions.
Experiments conducted on the GQA benchmark demonstrate that our framework achieves the compositional prior arts and competitive accuracy among monolithic ones.
arXiv Detail & Related papers (2020-10-10T05:46:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.