SVRepair: Structured Visual Reasoning for Automated Program Repair
- URL: http://arxiv.org/abs/2602.06090v1
- Date: Thu, 05 Feb 2026 06:26:46 GMT
- Title: SVRepair: Structured Visual Reasoning for Automated Program Repair
- Authors: Xiaoxuan Tang, Jincheng Wang, Liwei Luo, Jingxuan Xu, Sheng Zhou, Dajun Chen, Wei Jiang, Yong Li,
- Abstract summary: Large language models (LLMs) have recently shown strong potential for Automated Program Repair (APR)<n>We propose textbfSVRepair, a multimodal APR framework with structured visual representation.<n> SVRepair first fine-tunes a vision-language model, textbfStructured Visual Representation (SVR), to uniformly transform heterogeneous visual artifacts into a emphsemantic scene graph.
- Score: 17.545585659174773
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have recently shown strong potential for Automated Program Repair (APR), yet most existing approaches remain unimodal and fail to leverage the rich diagnostic signals contained in visual artifacts such as screenshots and control-flow graphs. In practice, many bug reports convey critical information visually (e.g., layout breakage or missing widgets), but directly using such dense visual inputs often causes context loss and noise, making it difficult for MLLMs to ground visual observations into precise fault localization and executable patches. To bridge this semantic gap, we propose \textbf{SVRepair}, a multimodal APR framework with structured visual representation. SVRepair first fine-tunes a vision-language model, \textbf{Structured Visual Representation (SVR)}, to uniformly transform heterogeneous visual artifacts into a \emph{semantic scene graph} that captures GUI elements and their structural relations (e.g., hierarchy), providing normalized, code-relevant context for downstream repair. Building on the graph, SVRepair drives a coding agent to localize faults and synthesize patches, and further introduces an iterative visual-artifact segmentation strategy that progressively narrows the input to bug-centered regions to suppress irrelevant context and reduce hallucinations. Extensive experiments across multiple benchmarks demonstrate state-of-the-art performance: SVRepair achieves \textbf{36.47\%} accuracy on SWE-Bench M, \textbf{38.02\%} on MMCode, and \textbf{95.12\%} on CodeVision, validating the effectiveness of SVRepair for multimodal program repair.
Related papers
- Visual Self-Refine: A Pixel-Guided Paradigm for Accurate Chart Parsing [76.2602505940467]
Existing models often struggle with visually dense charts, leading to errors like data omission, misalignment, and hallucination.<n>Inspired by the human strategy of using a finger as a visual anchor'' to ensure accuracy when reading complex charts, we propose a new paradigm named Visual Self-Refine (VSR)<n>The core idea of VSR is to enable a model to generate pixel-level localization outputs, visualize them, and then feed these visualizations back to itself, allowing it to intuitively inspect and correct its own potential visual perception errors.
arXiv Detail & Related papers (2026-02-18T13:40:53Z) - ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization [62.03035862528452]
ForgeryVCR is a framework that materializes imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning.<n>ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks.
arXiv Detail & Related papers (2026-02-15T11:14:47Z) - Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy [75.66913260900726]
Reinforcement Learning with Verifiable Rewards has significantly advanced reasoning capabilities in Large Language Models.<n>Existing paradigms, driven by text-centric outcome rewards, encourage models to bypass visual perception.<n>We propose textbfThinking with Deltas, a framework driven by a textbfDifferential Visual Reasoning Policy.
arXiv Detail & Related papers (2026-01-11T08:25:34Z) - Towards Fine-Grained Vision-Language Alignment for Few-Shot Anomaly Detection [65.29550320117526]
We propose a novel framework named FineGrainedAD to improve anomaly localization performance.<n> Experiments demonstrate that the proposed FineGrainedAD achieves superior overall performance in few-shot settings.
arXiv Detail & Related papers (2025-10-30T13:09:00Z) - Visual CoT Makes VLMs Smarter but More Fragile [79.32638667101817]
Chain-of-Thought (CoT) techniques have significantly enhanced reasoning in Vision-Language Models (VLMs)<n>Visual CoT integrates explicit visual edits, such as cropping or annotating regions of interest, into the reasoning process.<n>We present the first systematic evaluation of Visual CoT robustness under visual perturbations.
arXiv Detail & Related papers (2025-09-28T10:19:59Z) - Visual Structures Helps Visual Reasoning: Addressing the Binding Problem in VLMs [9.406760867809124]
This paper introduces Visual Input Structure for Enhanced Reasoning (VISER)<n>VISER is a simple, effective method that augments visual inputs with low-level spatial structures.<n>We empirically demonstrate substantial performance improvements across core visual reasoning tasks.
arXiv Detail & Related papers (2025-06-27T11:44:40Z) - Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing [41.75392938686494]
Large language model-(LLM) based automated program repair (APR) techniques have shown promising results in resolving real-world GitHub issue tasks.<n>These autonomous systems struggle to resolve multimodal problem scenarios due to limitations in interpreting and leveraging visual information.<n>We propose GUIRepair, a cross-modal reasoning approach for resolving multimodal issue scenarios by understanding and capturing visual information.
arXiv Detail & Related papers (2025-06-19T08:42:11Z) - Exploring Part-Informed Visual-Language Learning for Person Re-Identification [52.92511980835272]
We propose Part-Informed Visual-language Learning ($pi$-VL) to enhance fine-grained visual features with part-informed language supervisions for ReID tasks.<n>$pi$-VL introduces a human parsing-guided prompt tuning strategy and a hierarchical visual-language alignment paradigm to ensure within-part feature semantic consistency.<n>As a plug-and-play and inference-free solution, our $pi$-VL achieves performance comparable to or better than state-of-the-art methods on four commonly used ReID benchmarks.
arXiv Detail & Related papers (2023-08-04T23:13:49Z) - Good Visual Guidance Makes A Better Extractor: Hierarchical Visual
Prefix for Multimodal Entity and Relation Extraction [88.6585431949086]
We propose a novel Hierarchical Visual Prefix fusion NeTwork (HVPNeT) for visual-enhanced entity and relation extraction.
We regard visual representation as pluggable visual prefix to guide the textual representation for error insensitive forecasting decision.
Experiments on three benchmark datasets demonstrate the effectiveness of our method, and achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-05-07T02:10:55Z) - Contrastive Visual-Linguistic Pretraining [48.88553854384866]
Contrastive Visual-Linguistic Pretraining constructs a visual self-supervised loss built upon contrastive learning.
We evaluate it on several down-stream tasks, including VQA, GQA and NLVR2.
arXiv Detail & Related papers (2020-07-26T14:26:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.