Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
- URL: http://arxiv.org/abs/2602.16455v1
- Date: Wed, 18 Feb 2026 13:40:53 GMT
- Title: Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing
- Authors: Jinsong Li, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Dahua Lin,
- Abstract summary: Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination.<n>Inspired by the human strategy of using a finger as a visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR)<n>The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors.
- Score: 76.2602505940467
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities for reasoning and self-correction at the textual level, these strengths provide minimal benefits for complex tasks centered on visual perception, such as Chart Parsing. Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination. Inspired by the human strategy of using a finger as a ``visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR). The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors. We instantiate the VSR paradigm in the domain of Chart Parsing by proposing ChartVSR. This model decomposes the parsing process into two stages: a Refine Stage, where it iteratively uses visual feedback to ensure the accuracy of all data points' Pixel-level Localizations, and a Decode Stage, where it uses these verified localizations as precise visual anchors to parse the final structured data. To address the limitations of existing benchmarks, we also construct ChartP-Bench, a new and highly challenging benchmark for chart parsing. Our work also highlights VSR as a general-purpose visual feedback mechanism, offering a promising new direction for enhancing accuracy on a wide range of vision-centric tasks.
Related papers
- Rethinking Prompt Design for Inference-time Scaling in Text-to-Visual Generation [63.042451267669485]
We propose Prompt Redesign for Inference-time Scaling, a framework that adaptively revises the prompt during inference in response to scaled visual generations.<n>We introduce a new verifier, element-level factual correction, which evaluates the alignment between prompt attributes and generated visuals at a fine-grained level.<n>Experiments on both text-to-image and text-to-video benchmarks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2025-12-03T07:54:05Z) - ChartAB: A Benchmark for Chart Grounding & Dense Alignment [17.16234793106]
We introduce a novel "ChartAlign Benchmark (ChartAB)" to provide a comprehensive evaluation of vision-language models (VLMs)<n>By incorporating a novel two-stage inference workflow, the benchmark can further evaluate VLMs capability to align and compare elements/attributes across two charts.<n>Our analysis of evaluations reveals new insights into their perception biases, weaknesses, robustness, and hallucinations in chart understanding.
arXiv Detail & Related papers (2025-10-30T17:56:31Z) - BigCharts-R1: Enhanced Chart Reasoning with Visual Reinforcement Finetuning [51.472854950300416]
We propose BigCharts, a dataset creation pipeline that generates visually diverse chart images.<n>Unlike purely synthetic datasets, BigCharts incorporates real-world data, ensuring authenticity and visual diversity.<n>By introducing novel reward signals specifically designed for chart reasoning, our approach enhances model robustness and generalization.
arXiv Detail & Related papers (2025-08-13T13:39:17Z) - CAPE: A CLIP-Aware Pointing Ensemble of Complementary Heatmap Cues for Embodied Reference Understanding [56.30142869506262]
Embodied Reference Understanding involves predicting the object that a person in the scene is referring to through both pointing gesture and language.<n>We propose a dual-model framework, where one model learns from the head-to-fingertip direction and the other from the wrist-to-fingertip direction.<n>We validate our approach through extensive experiments and analysis on the benchmark YouRefIt dataset, achieving an improvement of approximately 4 mAP at the 0.25 IoU threshold.
arXiv Detail & Related papers (2025-07-29T15:00:21Z) - RefChartQA: Grounding Visual Answer on Chart Images through Instruction Tuning [63.599057862999]
RefChartQA is a novel benchmark that integrates Chart Question Answering (ChartQA) with visual grounding.<n>Our experiments demonstrate that incorporating spatial awareness via grounding improves response accuracy by over 15%.
arXiv Detail & Related papers (2025-03-29T15:50:08Z) - On the Perception Bottleneck of VLMs for Chart Understanding [35.2285781015848]
Chart understanding requires models to analyze and reason about numerical data, textual elements, and complex visual components.<n>Our observations reveal that the perception capabilities of existing large vision-language models (LVLMs) constitute a critical bottleneck in this process.<n>In this study, we delve into this perception bottleneck by decomposing it into two components: the vision encoder bottleneck, and the extraction bottleneck.
arXiv Detail & Related papers (2025-03-24T08:33:58Z) - End-to-End Chart Summarization via Visual Chain-of-Thought in Vision-Language Models [0.0]
This paper introduces End-to-End Visual Chain-of-Thought (V-CoT) for chart summarization.<n>Our method directly trains an LVLM to process chart images and generate textual summaries in an end-to-end fashion.<n>We incorporate a visual Chain-of-Thought mechanism through instruction fine-tuning, implicitly guiding the LVLM to perform visual reasoning steps.
arXiv Detail & Related papers (2025-02-24T19:13:45Z) - Semantic Object-level Modeling for Robust Visual Camera Relocalization [14.998133272060695]
We propose a novel method of automatic object-level voxel modeling for accurate ellipsoidal representations of objects.
All of these modules are entirely intergrated into visual SLAM system.
arXiv Detail & Related papers (2024-02-10T13:39:44Z) - Exploring Part-Informed Visual-Language Learning for Person Re-Identification [52.92511980835272]
We propose Part-Informed Visual-language Learning ($pi$-VL) to enhance fine-grained visual features with part-informed language supervisions for ReID tasks.<n>$pi$-VL introduces a human parsing-guided prompt tuning strategy and a hierarchical visual-language alignment paradigm to ensure within-part feature semantic consistency.<n>As a plug-and-play and inference-free solution, our $pi$-VL achieves performance comparable to or better than state-of-the-art methods on four commonly used ReID benchmarks.
arXiv Detail & Related papers (2023-08-04T23:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.