Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing
- URL: http://arxiv.org/abs/2506.16136v1
- Date: Thu, 19 Jun 2025 08:42:11 GMT
- Title: Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing
- Authors: Kai Huang, Jian Zhang, Xiaofei Xie, Chunyang Chen,
- Abstract summary: Large language model-(LLM) based automated program repair (APR) techniques have shown promising results in resolving real-world GitHub issue tasks.<n>These autonomous systems struggle to resolve multimodal problem scenarios due to limitations in interpreting and leveraging visual information.<n>We propose GUIRepair, a cross-modal reasoning approach for resolving multimodal issue scenarios by understanding and capturing visual information.
- Score: 41.75392938686494
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language model-(LLM) based automated program repair (APR) techniques have shown promising results in resolving real-world GitHub issue tasks. Existing APR systems are primarily evaluated in unimodal settings (e.g., SWE-bench). However, these autonomous systems struggle to resolve multimodal problem scenarios (e.g., SWE-bench M) due to limitations in interpreting and leveraging visual information. In multimodal scenarios, LLMs need to rely on visual information in the graphical user interface (GUI) to understand bugs and generate fixes. To bridge this gap, we propose GUIRepair, a cross-modal reasoning approach for resolving multimodal issue scenarios by understanding and capturing visual information. Specifically, GUIRepair integrates two key components, Image2Code and Code2Image, to enhance fault comprehension and patch validation. Image2Code extracts relevant project documents based on the issue report, then applies this domain knowledge to generate the reproduced code responsible for the visual symptoms, effectively translating GUI images into executable context for better fault comprehension. Code2Image replays the visual issue scenario using the reproduced code and captures GUI renderings of the patched program to assess whether the fix visually resolves the issue, providing feedback for patch validation. We evaluate GUIRepair on SWE-bench M, and the approach demonstrates significant effectiveness. When utilizing GPT-4o as the base model, GUIRepair solves 157 instances, outperforming the best open-source baseline by 26 instances. Furthermore, when using o4-mini as the base model, GUIRepair can achieve even better results and solve 175 instances, outperforming the top commercial system by 22 instances. This emphasizes the success of our new perspective on incorporating cross-modal reasoning by understanding and capturing visual information to resolve multimodal issues.
Related papers
- VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation [37.477428819390006]
We present VisCode-200K, a large-scale instruction tuning dataset for Python-based visualization and self-correction.<n>It contains over 200K examples from two sources: (1) validated plotting code from open-source repositories, paired with natural language instructions and rendered plots; and (2) 45K multi-turn correction dialogues from Code-Feedback.
arXiv Detail & Related papers (2025-06-04T13:24:44Z) - GUI-Actor: Coordinate-Free Visual Grounding for GUI Agents [93.49577107524176]
We propose GUI-Actor, a VLM-based method for coordinate-free GUI grounding.<n>At its core, GUI-Actor introduces an attention-based action head that learns to align a dedicated ACTOR> token with all relevant visual patch tokens.<n>Experiments show that GUI-Actor outperforms prior state-of-the-art methods on multiple GUI action grounding benchmarks.
arXiv Detail & Related papers (2025-06-03T17:59:08Z) - Visualized Text-to-Image Retrieval [55.178938325324864]
We propose Visualize-then-Retrieve (VisRet), a new paradigm for Text-to-Image (T2I) retrieval.<n>VisRet first projects textual queries into the image modality via T2I generation.<n>It then performs retrieval within the image modality to bypass the weaknesses of cross-modal retrievers in recognizing subtle visual-spatial features.
arXiv Detail & Related papers (2025-05-26T17:59:33Z) - Enhancing Repository-Level Software Repair via Repository-Aware Knowledge Graphs [8.467850621024672]
Repository-level software repair faces challenges in bridging semantic gaps between issue descriptions and code patches.<n>Existing approaches, which mostly depend on large language models (LLMs), suffer from semantic ambiguities, limited structural context understanding, and insufficient reasoning capability.<n>We propose a novel repository-aware knowledge graph (KG) that accurately links repository artifacts (issues and pull requests) and entities.
arXiv Detail & Related papers (2025-03-27T17:21:47Z) - ShowUI: One Vision-Language-Action Model for GUI Visual Agent [80.50062396585004]
Building Graphical User Interface (GUI) assistants holds significant promise for enhancing human workflow productivity.
We develop a vision-language-action model in digital world, namely ShowUI, which features the following innovations.
ShowUI, a lightweight 2B model using 256K data, achieves a strong 75.1% accuracy in zero-shot screenshot grounding.
arXiv Detail & Related papers (2024-11-26T14:29:47Z) - SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? [64.34184587727334]
We propose SWE-bench Multimodal to evaluate systems on their ability to fix bugs in visual, user-facing JavaScript software.
SWE-bench M features 617 task instances collected from 17 JavaScript libraries used for web interface design, diagramming, data visualization, syntax highlighting, and interactive mapping.
Our analysis finds that top-performing SWE-bench systems struggle with SWE-bench M, revealing limitations in visual problem-solving and cross-language generalization.
arXiv Detail & Related papers (2024-10-04T18:48:58Z) - Visual Haystacks: A Vision-Centric Needle-In-A-Haystack Benchmark [63.296342841358815]
Large Multimodal Models (LMMs) have made significant strides in visual question-answering for single images.<n>The ability to process a large number of visual tokens does not guarantee effective retrieval and reasoning for multi-image question answering.<n>We introduce MIRAGE, an open-source, lightweight visual-RAG framework that processes up to 10k images on a single 40G A100 GPU.
arXiv Detail & Related papers (2024-07-18T17:59:30Z) - Seeing is Believing: Vision-driven Non-crash Functional Bug Detection for Mobile Apps [26.96558418166514]
This paper proposes a novel vision-driven, multi-agent collaborative automated GUI testing approach for detecting non-crash functional bugs.<n>We evaluate Trident on 590 non-crash bugs and compare it with 12 baselines, it can achieve more than 14%-112% and 108%-147% boost in average recall and precision.
arXiv Detail & Related papers (2024-07-03T11:58:09Z) - VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning [6.035805925050573]
We introduce VGA, a fine-tuned model designed for comprehensive Graphical User Interface (GUI) understanding.
Our model aims to enhance the interpretation of visual data of GUI and reduce hallucinations.
Our dataset and fine-tuning script will be released soon.
arXiv Detail & Related papers (2024-06-20T07:24:43Z) - CogCoM: A Visual Language Model with Chain-of-Manipulations Reasoning [61.21923643289266]
Chain of Manipulations is a mechanism that enables Vision-Language Models to solve problems step-by-step with evidence.<n>After training, models can solve various visual problems by eliciting intrinsic manipulations (e.g., grounding, zoom in) actively without involving external tools.<n>Our trained model, textbfCogCoM, achieves state-of-the-art performance across 9 benchmarks from 4 categories.
arXiv Detail & Related papers (2024-02-06T18:43:48Z) - Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models [55.508049882447395]
Large Multimodal Models (LMMs) have shown promise in vision-language tasks but struggle with high-resolution input and detailed scene understanding.
We introduce Monkey to enhance LMM capabilities.
Monkey processes input images by dividing them into uniform patches, each matching the size (e.g., 448x448) used in the original training of the well-trained vision encoder.
It can handle higher resolutions up to 1344x896 pixels, enabling the detailed capture of complex visual information.
arXiv Detail & Related papers (2023-11-11T16:37:41Z) - LOIS: Looking Out of Instance Semantics for Visual Question Answering [17.076621453814926]
We propose a model framework without bounding boxes to understand the causal nexus of object semantics in images.
We implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information.
Our proposed attention model can further analyze salient image regions by focusing on important word-related questions.
arXiv Detail & Related papers (2023-07-26T12:13:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.