RECODE: Reasoning Through Code Generation for Visual Question Answering
- URL: http://arxiv.org/abs/2510.13756v1
- Date: Wed, 15 Oct 2025 17:05:37 GMT
- Title: RECODE: Reasoning Through Code Generation for Visual Question Answering
- Authors: Junhong Shen, Mu Cai, Bo Hu, Ameet Talwalkar, David A Ross, Cordelia Schmid, Alireza Fathi,
- Abstract summary: We propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning.<n>Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
- Score: 68.86938437188964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) struggle with precise reasoning for structured visuals like charts and diagrams, as pixel-based perception lacks a mechanism for verification. To address this, we propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning. Specifically, we propose RECODE, an agentic framework that first generates multiple candidate programs to reproduce the input image. It then uses a critic to select the most faithful reconstruction and iteratively refines the code. This process not only transforms an ambiguous perceptual task into a verifiable, symbolic problem, but also enables precise calculations and logical inferences later on. On various visual reasoning benchmarks such as CharXiv, ChartQA, and Geometry3K, RECODE significantly outperforms methods that do not leverage code or only use code for drawing auxiliary lines or cropping. Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
Related papers
- Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing [76.2602505940467]
Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination.<n>Inspired by the human strategy of using a finger as a visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR)<n>The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors.
arXiv Detail & Related papers (2026-02-18T13:40:53Z) - CodePlot-CoT: Mathematical Visual Reasoning by Thinking with Code-Driven Images [69.93976232543066]
We propose CodePlot-CoT, a code-driven Chain-of-Thought paradigm for "thinking with images" in mathematics.<n>To achieve this, we first construct Math-VR, the first large-scale, bilingual dataset and benchmark for Mathematics problems with Visual Reasoning.<n>Our model achieves up to 21% increase over base model on our new benchmark, fully validating the efficacy of our proposed code-driven reasoning paradigm.
arXiv Detail & Related papers (2025-10-13T17:59:55Z) - PixelCraft: A Multi-Agent System for High-Fidelity Visual Reasoning on Structured Images [58.73779101355669]
PixelCraft is a novel multi-agent system for high-fidelity image processing and flexible visual reasoning on structured images.<n>Building on this foundation, PixelCraft facilitates visual reasoning through a dynamic three-stage workflow of tool selection, agent discussion, and self-criticism.
arXiv Detail & Related papers (2025-09-29T17:59:49Z) - Visual Programmability: A Guide for Code-as-Thought in Chart Understanding [37.44645754630439]
We propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format.<n>Visual Programmability is a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis.<n>We implement this concept in an adaptive framework where a Vision-Language Models (VLMs) learns to choose between the CaT pathway and a direct visual reasoning pathway.
arXiv Detail & Related papers (2025-09-11T09:22:16Z) - Learning Only with Images: Visual Reinforcement Learning with Reasoning, Rendering, and Visual Feedback [33.127607245587576]
We introduce a framework that enables MLLMs to learn complex visual reasoning from only raw images.<n>We demonstrate that this relative ease provides an ideal reward signal for optimization via Reinforcement Learning.<n>The RRVF-trained model not only outperforms existing MLLMs and supervised fine-tuning baselines but also exhibits superior generalization.
arXiv Detail & Related papers (2025-07-28T12:21:19Z) - VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning [95.89543460132413]
Vision-language models (VLMs) have improved performance by increasing the number of visual tokens.<n>However, most real-world scenarios do not require such an extensive number of visual tokens.<n>We present a new paradigm for visual token compression, namely, VisionThink.
arXiv Detail & Related papers (2025-07-17T17:59:55Z) - VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation [69.35779796364413]
We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction.<n>It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback.
arXiv Detail & Related papers (2025-06-04T13:24:44Z) - LLM Code Customization with Visual Results: A Benchmark on TikZ [6.3303908500560615]
We introduce vTikZ, the first benchmark to evaluate the ability of Large Language Models to customize code while preserving coherent visual outcomes.<n>Our benchmark consists of carefully curated vTikZ editing scenarios, parameterized ground truths, and a reviewing tool that leverages visual feedback to assess correctness.
arXiv Detail & Related papers (2025-05-07T08:26:54Z) - Automated Visualization Code Synthesis via Multi-Path Reasoning and Feedback-Driven Optimization [13.178750787401263]
VisPath handles underspecified queries through structured, multi-stage processing.<n>It begins by reformulating the user input via Chain-of-Thought prompting.<n> VisPath generates targeted feedback that is aggregated to synthesize an optimal final result.
arXiv Detail & Related papers (2025-02-16T14:09:42Z) - Scalable Image Tokenization with Index Backpropagation Quantization [74.15447383432262]
Index Backpropagation Quantization (IBQ) is a new VQ method for the joint optimization of all codebook embeddings and the visual encoder.<n>IBQ enables scalable training of visual tokenizers and, for the first time, achieves a large-scale codebook with high dimension ($256$) and high utilization.
arXiv Detail & Related papers (2024-12-03T18:59:10Z) - Chain of Images for Intuitively Reasoning [23.692458865558486]
We present a Chain of Images (CoI) approach to convert complex language reasoning problems to simple pattern recognition.
We have developed a CoI evaluation dataset encompassing 15 distinct domains where images can intuitively aid problem-solving.
In supporting our CoI reasoning, we introduce a symbolic multimodal large language model (SyMLLM) that generates images strictly based on language instructions.
arXiv Detail & Related papers (2023-11-09T11:14:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.