Related papers: Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding

Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding

URL: http://arxiv.org/abs/2504.09764v1
Date: Mon, 14 Apr 2025 00:07:39 GMT
Title: Socratic Chart: Cooperating Multiple Agents for Robust SVG Chart Understanding
Authors: Yuyang Ji, Haohan Wang,
Abstract summary: Existing benchmarks reveal reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning.<n>We propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics representations.<n>Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance.
Score: 14.75820681491341
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLMs) have shown remarkable versatility but face challenges in demonstrating true visual understanding, particularly in chart reasoning tasks. Existing benchmarks like ChartQA reveal significant reliance on text-based shortcuts and probabilistic pattern-matching rather than genuine visual reasoning. To rigorously evaluate visual reasoning, we introduce a more challenging test scenario by removing textual labels and introducing chart perturbations in the ChartQA dataset. Under these conditions, models like GPT-4o and Gemini-2.0 Pro experience up to a 30% performance drop, underscoring their limitations. To address these challenges, we propose Socratic Chart, a new framework that transforms chart images into Scalable Vector Graphics (SVG) representations, enabling MLLMs to integrate textual and visual modalities for enhanced chart understanding. Socratic Chart employs a multi-agent pipeline with specialized agent-generators to extract primitive chart attributes (e.g., bar heights, line coordinates) and an agent-critic to validate results, ensuring high-fidelity symbolic representations. Our framework surpasses state-of-the-art models in accurately capturing chart primitives and improving reasoning performance, establishing a robust pathway for advancing MLLM visual understanding.

Related papers

ChartPoint: Guiding MLLMs with Grounding Reflection for Chart Reasoning [54.86473583610112]
We propose PointCoT, which integrates reflective interaction into chain-of-thought reasoning in charts.<n>By prompting MLLMs to generate bounding boxes and re-render charts based on location annotations, we establish connections between textual reasoning steps and visual grounding regions.<n>We develop two instruction-tuned models, ChartPointQ2 and ChartPointQ2.5, which outperform state-of-the-art across several chart benchmarks.
arXiv Detail & Related papers (2025-11-29T04:01:55Z)
ChartAgent: A Multimodal Agent for Visually Grounded Reasoning in Complex Chart Question Answering [23.455587605758396]
We introduce ChartAgent, a novel agentic framework that explicitly performs visual reasoning directly within the chart's spatial domain.<n>Our work is among the first to demonstrate visually grounded reasoning for chart understanding using tool-augmented multimodal agents.
arXiv Detail & Related papers (2025-10-06T06:05:36Z)
ChartMaster: Advancing Chart-to-Code Generation with Real-World Charts and Chart Similarity Reinforcement Learning [64.4193334712998]
The chart-to-code generation task requires MLLMs to convert chart images into executable code.<n>This task faces two main challenges: limited data diversity and the difficulty of maintaining visual consistency between generated charts and the original ones.<n>We propose ReChartPrompt, leveraging real-world, human-designed charts extracted from arXiv papers as prompts.<n>We also propose ChartSimRL, a GRPO-based reinforcement learning algorithm guided by a novel chart similarity reward.
arXiv Detail & Related papers (2025-08-25T02:32:56Z)
ChartLens: Fine-grained Visual Attribution in Charts [106.44872805609673]
Post-Hoc Visual Attribution for Charts identifies fine-grained chart elements that validate a given chart-associated response.<n>We propose ChartLens, a novel chart attribution algorithm that uses segmentation-based techniques to identify chart objects.<n>Our evaluations show that ChartLens improves fine-grained attributions by 26-66%.
arXiv Detail & Related papers (2025-05-25T23:17:32Z)
RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding.<n>Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z)
METAL: A Multi-Agent Framework for Chart Generation with Test-Time Scaling [100.33658998796064]
We build a vision-language model (VLM) based multi-agent framework for effective automatic chart generation.<n>We propose METAL, a multi-agent framework that decomposes the task of chart generation into the iterative collaboration among specialized agents.
arXiv Detail & Related papers (2025-02-24T21:01:39Z)
Graph-Based Multimodal Contrastive Learning for Chart Question Answering [11.828192162922436]
This work introduces a novel joint multimodal scene graph framework that explicitly models the relationships among chart components and their underlying structures.<n>The framework integrates both visual and textual graphs to capture structural and semantic characteristics.<n>A graph contrastive learning strategy aligns node representations across modalities enabling their seamless incorporation into a transformer decoder as soft prompts.
arXiv Detail & Related papers (2025-01-08T06:27:07Z)
VProChart: Answering Chart Question through Visual Perception Alignment Agent and Programmatic Solution Reasoning [13.011899331656018]
VProChart is a novel framework designed to address the challenges of Chart Question Answering (CQA) It integrates a lightweight Visual Perception Alignment Agent (VPAgent) and a Programmatic Solution Reasoning approach. VProChart significantly outperforms existing methods, highlighting its capability in understanding and reasoning with charts.
arXiv Detail & Related papers (2024-09-03T07:19:49Z)
MSG-Chart: Multimodal Scene Graph for ChartQA [11.828192162922436]
Automatic Chart Question Answering (ChartQA) is challenging due to the complex distribution of chart elements with patterns of the underlying data not explicitly displayed in charts. We design a joint multimodal scene graph for charts to explicitly represent the relationships between chart elements and their patterns. Our proposed multimodal scene graph includes a visual graph and a textual graph to jointly capture the structural and semantical knowledge from the chart.
arXiv Detail & Related papers (2024-08-09T04:11:23Z)
Advancing Chart Question Answering with Robust Chart Component Recognition [18.207819321127182]
We introduce a unified framework that enhances chart component recognition by accurately identifying and classifying components such as bars, lines, pies, titles, legends, and axes. We also propose a novel Question-guided Deformable Co-Attention mechanism, which fuses chart features encoded by Chartformer with the given question, leveraging the question's guidance to ground the correct answer.
arXiv Detail & Related papers (2024-07-19T20:55:06Z)
On Pre-training of Multimodal Language Models Customized for Chart Understanding [83.99377088129282]
This paper explores the training processes necessary to improve MLLMs' comprehension of charts. We introduce CHOPINLLM, an MLLM tailored for in-depth chart comprehension.
arXiv Detail & Related papers (2024-07-19T17:58:36Z)
TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning [83.58521787193293]
We present TinyChart, an efficient MLLM for chart understanding with only 3B parameters. TinyChart overcomes two key challenges in efficient chart understanding: (1) reduce the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, and (2) reduce lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module.
arXiv Detail & Related papers (2024-04-25T14:23:24Z)
StructChart: On the Schema, Metric, and Augmentation for Visual Chart Understanding [54.45681512355684]
Current chart-related tasks focus on either chart perception that extracts information from the visual charts, or chart reasoning given the extracted data. We introduce StructChart, a novel framework that leverages Structured Triplet Representations (STR) to achieve a unified and label-efficient approach.
arXiv Detail & Related papers (2023-09-20T12:51:13Z)
ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules [89.75395046894809]
We present ChartReader, a unified framework that seamlessly integrates chart derendering and comprehension tasks. Our approach includes a transformer-based chart component detection module and an extended pre-trained vision-language model for chart-to-X tasks. Our proposed framework can significantly reduce the manual effort involved in chart analysis, providing a step towards a universal chart understanding model.
arXiv Detail & Related papers (2023-04-05T00:25:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.