Related papers: Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

URL: http://arxiv.org/abs/2511.00780v1
Date: Sun, 02 Nov 2025 03:23:07 GMT
Title: Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
Authors: Chenyu Zhao, Shenglin Zhang, Zeshun Huang, Weilin Jin, Yongqian Sun, Dan Pei, Chaoyun Zhang, Qingwei Lin, Chetan Bansal, Saravan Rajmohan, Minghua Ma,
Abstract summary: Large language models (LLMs) have shown growing potential in software engineering.<n>Few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs)
Score: 44.748487030119
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have shown growing potential in software engineering, yet few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs). Cross-ISA migration, such as between x86_64 and aarch64, requires handling complex dependencies, heterogeneous toolchains, and long build logs while ensuring executable verification. To address this challenge, we present Build-bench, an end-to-end benchmark that systematically evaluates the capability of LLMs to repair build failures in cross-ISA settings. Build-bench collects 268 real-world failed packages and integrates auxiliary tools including Structure Extraction, File Content Extraction, Content Modification, and Build Verification to support autonomous, tool-augmented reasoning. The repair process operates in an iterative loop where, upon failure, the model receives updated build logs and previous repair outcomes to refine subsequent attempts. Through a comparative evaluation of six representative LLMs, Build-bench reveals that current models achieve a maximum build success rate of 63% and tool usage patterns differ significantly across models. By coupling real build environments with verifiable outcomes, Build-bench establishes the first architecture-aware benchmark for studying LLM-based software build and repair.

Related papers

TimeMachine-bench: A Benchmark for Evaluating Model Capabilities in Repository-Level Migration Tasks [12.573674060643787]
TimeMachine-bench is a benchmark designed to evaluate software migration in real-world Python projects.<n>Our benchmark consists of GitHub repositories whose tests begin to fail in response to dependency updates.
arXiv Detail & Related papers (2026-01-30T05:42:45Z)
ArchAgent: Scalable Legacy Software Architecture Recovery with LLMs [44.137226823695066]
ArchAgent is a scalable agent-based framework that combines static analysis, adaptive code segmentation, and LLM-powered synthesis.<n>It reconstructs multiview, business-aligned architectures from cross-repositorys.<n>ArchAgent introduces scalable diagram generation with contextual pruning and integrates cross-repository data to identify business-critical modules.
arXiv Detail & Related papers (2026-01-19T12:39:05Z)
A Benchmark for Language Models in Real-World System Building [56.549267258789904]
Cross-ISA software package repair is a critical task for ensuring the reliability of software deployment and the stability of modern operating systems.<n>We introduce a new benchmark designed for software package build repair across diverse architectures and languages.<n>We evaluate six state-of-the-art LLMs on the benchmark, and the results show that cross-ISA software package repair remains difficult and requires further advances.
arXiv Detail & Related papers (2026-01-19T10:30:46Z)
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software [39.43177863341685]
Existing methods rely on manually curated rules and cannot adapt to OSS that requires customized configuration or environment setup.<n>Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of highly rated OSS.<n>We propose a more challenging and realistic benchmark, BUILD-BENCH, comprising OSS that are more diverse in quality, scale, and characteristics.
arXiv Detail & Related papers (2025-09-27T03:02:46Z)
Evaluating the Limitations of Local LLMs in Solving Complex Programming Challenges [0.31498833540989407]
This study examines the performance of open-source, locally hosted large-language models (LLMs) in handling complex programming tasks.<n>Building on the original Framework for AI-driven Code Generation Evaluation (FACE), the authors retrofit the pipeline to work entirely offline.<n>Results show that the overall pass@1 accuracy is modest for the local models, with the best models performing at approximately half the acceptance rate of the proprietary models.
arXiv Detail & Related papers (2025-09-18T14:13:30Z)
VerlTool: Towards Holistic Agentic Reinforcement Learning with Tool Use [78.29315418819074]
We introduce VerlTool, a unified and modular framework that addresses limitations through systematic design principles.<n>Our framework formalizes ARLT as multi-turn trajectories with multi-modal observation tokens (text/image/video), extending beyond single-turn RLVR paradigms.<n>The modular plugin architecture enables rapid tool integration requiring only lightweight Python definitions.
arXiv Detail & Related papers (2025-09-01T01:45:18Z)
MLE-Dojo: Interactive Environments for Empowering LLM Agents in Machine Learning Engineering [57.156093929365255]
Gym-style framework for systematically reinforcement learning, evaluating, and improving autonomous large language model (LLM) agents.<n>MLE-Dojo covers diverse, open-ended MLE tasks carefully curated to reflect realistic engineering scenarios.<n>Its fully executable environment supports comprehensive agent training via both supervised fine-tuning and reinforcement learning.
arXiv Detail & Related papers (2025-05-12T17:35:43Z)
APE-Bench I: Towards File-level Automated Proof Engineering of Formal Math Libraries [5.227446378450704]
APE-Bench I is the first realistic benchmark built from real-world commit histories of Mathlib4.<n>Eleanstic is a scalable parallel verification infrastructure optimized for proof checking across multiple versions of Mathlib.
arXiv Detail & Related papers (2025-04-27T05:04:02Z)
Large Language Model Critics for Execution-Free Evaluation of Code Changes [5.1973075342632535]
Large language models (LLMs) offer a promising way to automate software engineering tasks.<n>Existing metrics for evaluating such, mainly build status and occasionally log analysis, are too sparse and limited in providing the information needed to assess the quality of changes made.<n>In this work, we designed LLM-based critics to derive well-structured and rigorous intermediate/step-level, execution-free evaluation proxies for executability of code changes.
arXiv Detail & Related papers (2025-01-28T02:38:56Z)
CRAFT: Customizing LLMs by Creating and Retrieving from Specialized Toolsets [75.64181719386497]
We present CRAFT, a tool creation and retrieval framework for large language models (LLMs) It creates toolsets specifically curated for the tasks and equips LLMs with a component that retrieves tools from these sets to enhance their capability to solve complex tasks. Our method is designed to be flexible and offers a plug-and-play approach to adapt off-the-shelf LLMs to unseen domains and modalities, without any finetuning.
arXiv Detail & Related papers (2023-09-29T17:40:26Z)
CREATOR: Tool Creation for Disentangling Abstract and Concrete Reasoning of Large Language Models [74.22729793816451]
Large Language Models (LLMs) have made significant progress in utilizing tools, but their ability is limited by API availability. We propose CREATOR, a novel framework that enables LLMs to create their own tools using documentation and code realization. We evaluate CREATOR on MATH and TabMWP benchmarks, respectively consisting of challenging math competition problems.
arXiv Detail & Related papers (2023-05-23T17:51:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.