Related papers: LLM-CompDroid: Repairing Configuration Compatibility Bugs in Android Apps with Pre-trained Large Language Models

LLM-CompDroid: Repairing Configuration Compatibility Bugs in Android Apps with Pre-trained Large Language Models

URL: http://arxiv.org/abs/2402.15078v1
Date: Fri, 23 Feb 2024 03:51:16 GMT
Title: LLM-CompDroid: Repairing Configuration Compatibility Bugs in Android Apps with Pre-trained Large Language Models
Authors: Zhijie Liu, Yutian Tang, Meiyun Li, Xin Jin, Yunfei Long, Liang Feng Zhang, Xiapu Luo
Abstract summary: We introduce the LLM-CompDroid framework, which combines the strengths of LLMs and traditional tools for bug resolution. Our experimental results demonstrate a significant enhancement in bug resolution performance by LLM-CompDroid. This innovative approach holds promise for advancing the reliability and robustness of Android applications.
Score: 34.23051590289707
License: http://creativecommons.org/licenses/by/4.0/
Abstract: XML configurations are integral to the Android development framework, particularly in the realm of UI display. However, these configurations can introduce compatibility issues (bugs), resulting in divergent visual outcomes and system crashes across various Android API versions (levels). In this study, we systematically investigate LLM-based approaches for detecting and repairing configuration compatibility bugs. Our findings highlight certain limitations of LLMs in effectively identifying and resolving these bugs, while also revealing their potential in addressing complex, hard-to-repair issues that traditional tools struggle with. Leveraging these insights, we introduce the LLM-CompDroid framework, which combines the strengths of LLMs and traditional tools for bug resolution. Our experimental results demonstrate a significant enhancement in bug resolution performance by LLM-CompDroid, with LLM-CompDroid-GPT-3.5 and LLM-CompDroid-GPT-4 surpassing the state-of-the-art tool, ConfFix, by at least 9.8% and 10.4% in both Correct and Correct@k metrics, respectively. This innovative approach holds promise for advancing the reliability and robustness of Android applications, making a valuable contribution to the field of software development.

Related papers

Framework-Aware Code Generation with API Knowledge Graph-Constructed Data: A Study on HarmonyOS [52.483888557864326]
APIKG4SYN is a framework designed to exploit API knowledge graphs for the construction of API-oriented question-code pairs.<n>We build the first benchmark for HarmonyOS code generation using APIKG4SYN.
arXiv Detail & Related papers (2025-11-29T08:13:54Z)
Diagnosing and Resolving Android Applications Building Issues: An Empirical Study [4.9727667541752085]
This study conducts an empirical analysis of 200 open-source Android projects written in Java and Kotlin to diagnose and resolve build failures.<n>We identified four primary types of build errors: environment issues, dependency and Gradle task errors, configuration problems, and syntax/API incompatibilities.
arXiv Detail & Related papers (2025-11-09T02:01:14Z)
Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools [11.19523991999335]
We introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects.<n>Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible.<n>We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions.
arXiv Detail & Related papers (2025-10-09T01:33:25Z)
Bridging Solidity Evolution Gaps: An LLM-Enhanced Approach for Smart Contract Compilation Error Resolution [2.967464333639626]
Solidity, the dominant smart contract language, has rapidly evolved with frequent version updates to enhance security, functionality, and developer experience.<n>We conduct an empirical study to investigate the challenges in the Solidity version evolution and reveal that 81.68% of examined contracts encounter errors when compiled across different versions, with 86.92% of compilation errors.<n>We introduce SMCFIXER, a novel framework that integrates expert knowledge retrieval with LLM-based repair mechanisms for Solidity compilation error resolution.
arXiv Detail & Related papers (2025-08-14T10:42:26Z)
A First Look at Bugs in LLM Inference Engines [20.52104110967081]
We present the first empirical study on bugs in LLM inference engines.<n>We mine official repositories of 5 widely adopted LLM inference engines, constructing a comprehensive dataset of 929 real-world bugs.<n>Our findings reveal six major bug symptoms and a taxonomy of 28 root causes, shedding light on the key challenges in bug detection and location.
arXiv Detail & Related papers (2025-06-11T13:25:36Z)
PATCH: Empowering Large Language Model with Programmer-Intent Guidance and Collaborative-Behavior Simulation for Automatic Bug Fixing [34.768989900184636]
Bug fixing holds significant importance in software development and maintenance. Recent research has made substantial strides in exploring the potential of large language models (LLMs) for automatically resolving software bugs.
arXiv Detail & Related papers (2025-01-27T15:43:04Z)
LLM2: Let Large Language Models Harness System 2 Reasoning [65.89293674479907]
Large language models (LLMs) have exhibited impressive capabilities across a myriad of tasks, yet they occasionally yield undesirable outputs. We introduce LLM2, a novel framework that combines an LLM with a process-based verifier. LLMs2 is responsible for generating plausible candidates, while the verifier provides timely process-based feedback to distinguish desirable and undesirable outputs.
arXiv Detail & Related papers (2024-12-29T06:32:36Z)
ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs [77.79172008184415]
TOOLSCAN is a new benchmark to identify error patterns in LLM output on tool-use tasks.<n>We show that even the most prominent LLMs exhibit these error patterns in their outputs.<n>Researchers can use these insights from TOOLSCAN to guide their error mitigation strategies.
arXiv Detail & Related papers (2024-11-20T18:56:22Z)
Robust Planning with Compound LLM Architectures: An LLM-Modulo Approach [17.888087571630933]
Previous work has attempted to boost Large Language Model (LLM) performance on planning and scheduling tasks. We present a technical evaluation of a compound LLM architecture--the LLM-Modulo framework.
arXiv Detail & Related papers (2024-11-20T02:04:09Z)
CONGRA: Benchmarking Automatic Conflict Resolution [3.9910625211670485]
ConGra is a benchmarking scheme designed to evaluate the performance of software merging tools. We build a large-scale evaluation dataset based on 44,948 conflicts from 34 real-world projects.
arXiv Detail & Related papers (2024-09-21T12:21:41Z)
CodeEditorBench: Evaluating Code Editing Capability of Large Language Models [49.387195629660994]
Large Language Models (LLMs) for code are rapidly evolving, with code editing emerging as a critical capability. We introduce CodeEditorBench, an evaluation framework designed to rigorously assess the performance of LLMs in code editing tasks. We curate diverse coding challenges and scenarios from five sources, covering various programming languages, complexity levels, and editing tasks.
arXiv Detail & Related papers (2024-04-04T15:49:49Z)
DevBench: A Comprehensive Benchmark for Software Development [72.24266814625685]
DevBench is a benchmark that evaluates large language models (LLMs) across various stages of the software development lifecycle. Empirical studies show that current LLMs, including GPT-4-Turbo, fail to solve the challenges presented within DevBench. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
arXiv Detail & Related papers (2024-03-13T15:13:44Z)
PPTC-R benchmark: Towards Evaluating the Robustness of Large Language Models for PowerPoint Task Completion [96.47420221442397]
We construct adversarial user instructions by attacking user instructions at sentence, semantic, and multi-language levels. We test 3 closed-source and 4 open-source LLMs using a benchmark that incorporates robustness settings. We find that GPT-4 exhibits the highest performance and strong robustness in our benchmark.
arXiv Detail & Related papers (2024-03-06T15:33:32Z)
Learning to Edit: Aligning LLMs with Knowledge Editing [101.96620267293731]
We propose a Learning to Edit (LTE) framework, focusing on teaching large language models to apply updated knowledge into input questions. LTE features a two-phase process: (i) the Alignment Phase, which fine-tunes LLMs on a meticulously curated parallel dataset to make reliable, in-scope edits. We demonstrate LTE's superiority in knowledge editing performance, robustness in both batch and sequential editing, minimal interference on general tasks, and rapid editing speeds.
arXiv Detail & Related papers (2024-02-19T07:45:17Z)
Exploring Large Language Models in Resolving Environment-Related Crash Bugs: Localizing and Repairing [36.4673637256627]
Large language models (LLMs) have shown promise in software engineering tasks.<n>We conduct the first comprehensive study to assess the capability of LLMs in resolving real-world environment crash bugs.<n>Our findings reveal that localization is the primary challenge for resolving code crashes, while repair poses a greater challenge for environment crashes.
arXiv Detail & Related papers (2023-12-16T13:41:04Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
Evaluating Diverse Large Language Models for Automatic and General Bug Reproduction [12.851941377433285]
Large language models (LLMs) have been demonstrated to be adept at natural language processing and code generation. Our proposed technique LIBRO could successfully reproduce about one-third of all bugs in the widely used Defects4J benchmark.
arXiv Detail & Related papers (2023-11-08T08:42:30Z)
ChatGPT vs SBST: A Comparative Assessment of Unit Test Suite Generation [25.200080365022153]
We present a systematic comparison of test suites generated by the ChatGPT LLM and the state-of-the-art SBST tool EvoSuite. Our comparison is based on several critical factors, including correctness, readability, code coverage, and bug detection capability.
arXiv Detail & Related papers (2023-07-02T15:09:40Z)
Towards Generating Functionally Correct Code Edits from Natural Language Issue Descriptions [11.327913840111378]
We introduce Defects4J-NL2Fix, a dataset of 283 Java programs from the popular Defects4J dataset augmented with high-level descriptions of bug fixes. We empirically evaluate the performance of several state-of-the-art LLMs for the this task.
arXiv Detail & Related papers (2023-04-07T18:58:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.