Related papers: Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

Visual Programmability: A Guide for Code-as-Thought in Chart Understanding

URL: http://arxiv.org/abs/2509.09286v1
Date: Thu, 11 Sep 2025 09:22:16 GMT
Title: Visual Programmability: A Guide for Code-as-Thought in Chart Understanding
Authors: Bohao Tang, Yan Ma, Fei Zhang, Jiadi Su, Ethan Chern, Zhulin Hu, Zhixin Wang, Pengfei Liu, Ya Zhang,
Abstract summary: We propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format.<n>Visual Programmability is a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis.<n>We implement this concept in an adaptive framework where a Vision-Language Models (VLMs) learns to choose between the CaT pathway and a direct visual reasoning pathway.
Score: 37.44645754630439
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Chart understanding presents a critical test to the reasoning capabilities of Vision-Language Models (VLMs). Prior approaches face critical limitations: some rely on external tools, making them brittle and constrained by a predefined toolkit, while others fine-tune specialist models that often adopt a single reasoning strategy, such as text-based chain-of-thought (CoT). The intermediate steps of text-based reasoning are difficult to verify, which complicates the use of reinforcement-learning signals that reward factual accuracy. To address this, we propose a Code-as-Thought (CaT) approach to represent the visual information of a chart in a verifiable, symbolic format. Our key insight is that this strategy must be adaptive: a fixed, code-only implementation consistently fails on complex charts where symbolic representation is unsuitable. This finding leads us to introduce Visual Programmability: a learnable property that determines if a chart-question pair is better solved with code or direct visual analysis. We implement this concept in an adaptive framework where a VLM learns to choose between the CaT pathway and a direct visual reasoning pathway. The selection policy of the model is trained with reinforcement learning using a novel dual-reward system. This system combines a data-accuracy reward to ground the model in facts and prevent numerical hallucination, with a decision reward that teaches the model when to use each strategy, preventing it from defaulting to a single reasoning mode. Experiments demonstrate strong and robust performance across diverse chart-understanding benchmarks. Our work shows that VLMs can be taught not only to reason but also how to reason, dynamically selecting the optimal reasoning pathway for each task.

Related papers

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks [56.98385132295952]
We evaluate how well chain-of-thought approaches generalize on a simple planning task.<n>We find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization.<n> purely text-based models consistently outperform those utilizing image-based inputs.
arXiv Detail & Related papers (2026-02-17T09:51:40Z)
ChartVerse: Scaling Chart Reasoning via Reliable Programmatic Synthesis from Scratch [57.01439313241121]
We introduce Rollout Posterior Entropy (RPE), a novel metric that quantifies chart complexity.<n>We also develop truth-anchored inverse QA synthesis to guarantee reasoning rigor.<n>To further elevate difficulty and reasoning depth, we filter samples based on model fail-rate and distill high-quality Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2026-01-20T05:11:44Z)
MiCo: Multi-image Contrast for Reinforcement Visual Reasoning [72.81576836419373]
Chain-of-Thought (CoT) reasoning can be used to link visual cues across multiple images.<n>We adapt rule-based reinforcement learning for Vision-Language Models (VLMs)<n>Our method achieves significant improvements on multi-image reasoning benchmarks and shows strong performance on general vision tasks.
arXiv Detail & Related papers (2025-06-27T17:59:27Z)
ChartReasoner: Code-Driven Modality Bridging for Long-Chain Reasoning in Chart Question Answering [12.285453136336507]
We propose a code-driven framework designed to enable precise, interpretable reasoning over charts.<n>We first train a high-fidelity model to convert diverse chart images into structured ECharts codes.<n>Then, we design a general chart reasoning data synthesis pipeline.<n>Finally, we train the final multimodal model using a combination of supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-06-11T18:55:36Z)
ChartSketcher: Reasoning with Multimodal Feedback and Reflection for Chart Understanding [18.67532755744138]
Automated chart understanding poses significant challenges to existing multimodal large language models.<n>Current step-by-step reasoning models primarily focus on text-based logical reasoning for chart understanding.<n>We propose ChartSketcher, a multimodal feedback-driven step-by-step reasoning method designed to address these limitations.
arXiv Detail & Related papers (2025-05-25T10:21:29Z)
Compile Scene Graphs with Reinforcement Learning [69.36723767339001]
Next-token prediction is the fundamental principle for training large language models (LLMs)<n>We introduce R1-SGG, a multimodal LLM (M-LLM) initially trained via supervised fine-tuning (SFT) on the scene graph dataset.<n>We design a set of graph-centric rewards, including three recall-based variants -- Hard Recall, Hard Recall+Relax, and Soft Recall.
arXiv Detail & Related papers (2025-04-18T10:46:22Z)
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models [58.64449765678416]
We introduce landscape of thoughts (LoT) to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset.<n>LoT distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks.<n>We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories.
arXiv Detail & Related papers (2025-03-28T06:09:51Z)
End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models [0.0]
This paper introduces End-to-End Visual Chain-of-Thought (V-CoT) for chart summarization.<n>Our method directly trains an LVLM to process chart images and generate textual summaries in an end-to-end fashion.<n>We incorporate a visual Chain-of-Thought mechanism through instruction fine-tuning, implicitly guiding the LVLM to perform visual reasoning steps.
arXiv Detail & Related papers (2025-02-24T19:13:45Z)
VProChart: Answering Chart Question through Visual Perception Alignment Agent and Programmatic Solution Reasoning [13.011899331656018]
VProChart is a novel framework designed to address the challenges of Chart Question Answering (CQA)<n>It integrates a lightweight Visual Perception Alignment Agent (VPAgent) and a Programmatic Solution Reasoning approach.<n>VProChart significantly outperforms existing methods, highlighting its capability in understanding and reasoning with charts.
arXiv Detail & Related papers (2024-09-03T07:19:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.