Related papers: Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools

Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools

URL: http://arxiv.org/abs/2510.08640v1
Date: Thu, 09 Oct 2025 01:33:25 GMT
Title: Automating Android Build Repair: Bridging the Reasoning-Execution Gap in LLM Agents with Domain-Specific Tools
Authors: Ha Min Son, Huan Ren, Xin Liu, Zhe Zhao,
Abstract summary: We introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects.<n>Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible.<n>We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions.
Score: 11.19523991999335
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Android is the largest mobile platform, yet automatically building applications remains a practical challenge. While Large Language Models (LLMs) show promise for code repair, their use for fixing Android build errors remains underexplored. To address this gap, we first introduce AndroidBuildBench, a benchmark of 1,019 build failures curated from the commit histories of 43 open-source Android projects. Each problem is paired with a verified solution from a subsequent commit, ensuring that fixes are feasible. Second, we propose GradleFixer, an LLM agent with domain-specific tools for inspecting and manipulating the Gradle build environment. GradleFixer achieves a resolve rate of 81.4% (pass@1), significantly outperforming a state-of-the-art coding agent that relies on a general-purpose shell. GradleFixer's success suggests that while LLMs possess the high-level knowledge to solve these failures, they struggle to translate this knowledge into effective low-level actions using a general-purpose shell. We demonstrate the effectiveness of a strategy we term Tool Bridging, which replaces general-purpose shell commands with domain-aware abstractions. We hypothesize this approach works through two mechanisms: 1) it provides tools in an API-like format that LLMs use more reliably, and 2) it constrains the action space to relevant operations. This approach bridges the gap between the model's high-level reasoning and effective low-level execution.

Related papers

Gecko: A Simulation Environment with Stateful Feedback for Refining Agent Tool Calls [56.407063247662336]
We introduce Gecko, a comprehensive environment that simulates tool responses using a combination of rules and LLMs.<n>GATS consistently improves the tool calling performance of various LLMs including GPT-4o, GPT-5, and Gemini-3.0-pro.
arXiv Detail & Related papers (2026-02-22T15:02:00Z)
Verification-Guided Context Optimization for Tool Calling via Hierarchical LLMs-as-Editors [6.990045323115733]
We propose VGCO, a framework that uses large language models as editors to automatically refine tool-related documentation and knowledge base context.<n>First, they use a hierarchical structure that naturally integrates into the tool-calling workflow.<n>Second, they are state-aware, action-specific, and verification-guided, which constrains the search space and enables efficient, targeted improvements.
arXiv Detail & Related papers (2025-12-15T19:48:21Z)
Diagnosing and Resolving Android Applications Building Issues: An Empirical Study [4.9727667541752085]
This study conducts an empirical analysis of 200 open-source Android projects written in Java and Kotlin to diagnose and resolve build failures.<n>We identified four primary types of build errors: environment issues, dependency and Gradle task errors, configuration problems, and syntax/API incompatibilities.
arXiv Detail & Related papers (2025-11-09T02:01:14Z)
A Systematic Study of Time Limit Exceeded Errors in Online Programming Assignments [3.5043598215781393]
This paper presents the first large-scale empirical study of TLE errors in online programming.<n>We analyze 1000 Codeforces submissions with TLE errors, classified their root causes, and traced how users attempted to fix them.<n>We introduce Nettle, the first automated repair tool specifically designed for TLE errors, and Nettle-Eval, the first framework for evaluating TLE repairs.
arXiv Detail & Related papers (2025-10-16T06:18:55Z)
Where LLM Agents Fail and How They can Learn From Failures [62.196870049524364]
Large Language Model (LLM) agents have shown promise in solving complex, multi-step tasks.<n>They amplify vulnerability to cascading failures, where a single root-cause error propagates through subsequent decisions.<n>Current systems lack a framework that can comprehensively understand agent error in a modular and systemic way.<n>We introduce the AgentErrorTaxonomy, a modular classification of failure modes spanning memory, reflection, planning, action, and system-level operations.
arXiv Detail & Related papers (2025-09-29T18:20:27Z)
Learning to Use Tools via Cooperative and Interactive Agents [58.77710337157665]
Tool learning empowers large language models (LLMs) as agents to use external tools and extend their utility. We propose ConAgents, a Cooperative and interactive Agents framework, which coordinates three specialized agents for tool selection, tool execution, and action calibration separately. Our experiments on three datasets show that the LLMs, when equipped with ConAgents, outperform baselines with substantial improvement.
arXiv Detail & Related papers (2024-03-05T15:08:16Z)
LLM-CompDroid: Repairing Configuration Compatibility Bugs in Android Apps with Pre-trained Large Language Models [34.23051590289707]
We introduce the LLM-CompDroid framework, which combines the strengths of LLMs and traditional tools for bug resolution. Our experimental results demonstrate a significant enhancement in bug resolution performance by LLM-CompDroid. This innovative approach holds promise for advancing the reliability and robustness of Android applications.
arXiv Detail & Related papers (2024-02-23T03:51:16Z)
Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub [79.31134731122462]
We introduce OpenAct benchmark to evaluate the open-domain task-solving capability, built on human expert consultation and repositories in GitHub.<n>We present OpenAgent, a novel LLM-based agent system that can tackle evolving queries in open domains through autonomously integrating specialized tools from GitHub.
arXiv Detail & Related papers (2023-12-28T15:47:30Z)
ML-Bench: Evaluating Large Language Models and Agents for Machine Learning Tasks on Repository-Level Code [76.84199699772903]
ML-Bench is a benchmark rooted in real-world programming applications that leverage existing code repositories to perform tasks. To evaluate both Large Language Models (LLMs) and AI agents, two setups are employed: ML-LLM-Bench for assessing LLMs' text-to-code conversion within a predefined deployment environment, and ML-Agent-Bench for testing autonomous agents in an end-to-end task execution within a Linux sandbox environment.
arXiv Detail & Related papers (2023-11-16T12:03:21Z)
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs) It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z)
AutoDroid: LLM-powered Task Automation in Android [32.241570727243534]
We introduce AutoDroid, a mobile task automation system capable of handling arbitrary tasks on any Android application without manual efforts. The main components include a functionality-aware UI representation method that bridges the UI with the LLM. We evaluate its performance on a new benchmark for memory-augmented Android task automation with 158 common tasks.
arXiv Detail & Related papers (2023-08-29T13:02:30Z)
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs [104.37772295581088]
Open-source large language models (LLMs), e.g., LLaMA, remain significantly limited in tool-use capabilities. We introduce ToolLLM, a general tool-usetuning encompassing data construction, model training, and evaluation. We first present ToolBench, an instruction-tuning framework for tool use, which is constructed automatically using ChatGPT.
arXiv Detail & Related papers (2023-07-31T15:56:53Z)
DexBERT: Effective, Task-Agnostic and Fine-grained Representation Learning of Android Bytecode [0.40571357119162643]
We propose a BERT-like Language Model dedicated to representing chunks of DEX bytecode, the main binary format used in Android applications. We empirically assess whether DexBERT is able to model the DEX language and evaluate the suitability of our model in three distinct class-level software engineering tasks.
arXiv Detail & Related papers (2022-12-12T15:32:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.