Related papers: A Benchmark for Language Models in Real-World System Building

A Benchmark for Language Models in Real-World System Building

URL: http://arxiv.org/abs/2601.12927v1
Date: Mon, 19 Jan 2026 10:30:46 GMT
Title: A Benchmark for Language Models in Real-World System Building
Authors: Weilin Jin, Chenyu Zhao, Zeshun Huang, Chaoyun Zhang, Qingwei Lin, Chetan Bansal, Saravan Rajmohan, Shenglin Zhang, Yongqian Sun, Dan Pei, Yifan Wu, Tong Jia, Ying Li, Zhonghai Wu, Minghua Ma,
Abstract summary: Cross-ISA software package repair is a critical task for ensuring the reliability of software deployment and the stability of modern operating systems.<n>We introduce a new benchmark designed for software package build repair across diverse architectures and languages.<n>We evaluate six state-of-the-art LLMs on the benchmark, and the results show that cross-ISA software package repair remains difficult and requires further advances.
Score: 56.549267258789904
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: During migration across instruction set architectures (ISAs), software package build repair is a critical task for ensuring the reliability of software deployment and the stability of modern operating systems. While Large Language Models (LLMs) have shown promise in tackling this challenge, prior work has primarily focused on single instruction set architecture (ISA) and homogeneous programming languages. To address this limitation, we introduce a new benchmark designed for software package build repair across diverse architectures and languages. Comprising 268 real-world software package build failures, the benchmark provides a standardized evaluation pipeline. We evaluate six state-of-the-art LLMs on the benchmark, and the results show that cross-ISA software package repair remains difficult and requires further advances. By systematically exposing this challenge, the benchmark establishes a foundation for advancing future methods aimed at improving software portability and bridging architectural gaps.

Related papers

Towards Comprehensive Benchmarking Infrastructure for LLMs In Software Engineering [19.584762693453893]
BEHELM is a holistic benchmarking infrastructure that unifies software-scenario specification with multi-metric evaluation.<n>Our goal is to reduce the overhead currently required to construct benchmarks while enabling a fair, realistic, and future-proof assessment of LLMs in software engineering.
arXiv Detail & Related papers (2026-01-28T21:55:10Z)
Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation [4.45354703148321]
Assembly-to-source code translation is a critical task in reverse engineering, cybersecurity, and software maintenance.<n>We present the first comprehensive evaluation of five state-of-the-art large language models on assembly-to-source translation.
arXiv Detail & Related papers (2025-11-28T12:40:30Z)
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models [59.90381306452982]
evaluating large language models (LLMs) for software engineering has been limited by narrow task coverage, language bias, and insufficient alignment with real-world developer.<n>We introduce SWE-1, a comprehensive benchmark that unifies heterogeneous code-related evaluations into a structured and production-aligned framework.<n>SWE- spans 8 task types, 8 programming scenarios, and 10 programming languages, with 2000 high-quality instances curated from authentic GitHub pull requests.
arXiv Detail & Related papers (2025-11-07T18:01:32Z)
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems [44.748487030119]
Large language models (LLMs) have shown growing potential in software engineering.<n>Few benchmarks evaluate their ability to repair software during migration across instruction set architectures (ISAs)
arXiv Detail & Related papers (2025-11-02T03:23:07Z)
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software [39.43177863341685]
Existing methods rely on manually curated rules and cannot adapt to OSS that requires customized configuration or environment setup.<n>Recent attempts using Large Language Models (LLMs) used selective evaluation on a subset of highly rated OSS.<n>We propose a more challenging and realistic benchmark, BUILD-BENCH, comprising OSS that are more diverse in quality, scale, and characteristics.
arXiv Detail & Related papers (2025-09-27T03:02:46Z)
MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks [56.34018316319873]
We propose MERA Code, a benchmark for evaluating code for the latest code generation LLMs in Russian.<n>This benchmark includes 11 evaluation tasks that span 8 programming languages.<n>We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages.
arXiv Detail & Related papers (2025-07-16T14:31:33Z)
Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees [0.03994567502796063]
We introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models with the rigor of established software testing constructs.<n>Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation.<n>We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs.
arXiv Detail & Related papers (2025-06-17T15:06:54Z)
BinMetric: A Comprehensive Binary Analysis Benchmark for Large Language Models [50.17907898478795]
We introduce BinMetric, a benchmark designed to evaluate the performance of large language models on binary analysis tasks.<n>BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks.<n>Our empirical study on this benchmark investigates the binary analysis capabilities of various state-of-the-art LLMs, revealing their strengths and limitations in this field.
arXiv Detail & Related papers (2025-05-12T08:54:07Z)
LangProBe: a Language Programs Benchmark [53.81811700561928]
We introduce LangProBe, the first large-scale benchmark for evaluating the architectures and optimization strategies for language programs.<n>We find that optimized language programs offer strong cost--quality improvement over raw calls to models, but simultaneously demonstrate that human judgment is still necessary for best performance.
arXiv Detail & Related papers (2025-02-27T17:41:49Z)
Towards a Probabilistic Framework for Analyzing and Improving LLM-Enabled Software [0.0]
Large language model (LLM)-enabled systems are a significant challenge in software engineering.<n>We propose a probabilistic framework for systematically analyzing and improving these systems.
arXiv Detail & Related papers (2025-01-10T22:42:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.