Variability-Aware Detection and Repair of Compilation Errors Using Foundation Models in Configurable Systems
- URL: http://arxiv.org/abs/2601.16755v1
- Date: Fri, 23 Jan 2026 13:59:34 GMT
- Title: Variability-Aware Detection and Repair of Compilation Errors Using Foundation Models in Configurable Systems
- Authors: Rohit Gheyi, Lucas Albuquerque, Márcio Ribeiro, Eduardo Almeida, Danyllo Albuquerque, Mirko Perkusich,
- Abstract summary: We show that foundation models can effectively identify variability-induced compilation errors.<n>For compilation error repair, GPT-OSS-20B produced compilable fixes in over 70% of the cases.<n>Our findings indicate that current state-of-the-art foundation models provide a practical and low-effort complement to traditional variability-aware analyses.
- Score: 1.2560438996036287
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern software systems often rely on conditional compilation to support optional features and multiple deployment scenarios. In configurable systems, compilation errors may arise only under specific combinations of features, remaining hidden during development and testing. Such variability-induced errors are difficult to detect in practice, as traditional compilers analyze only a single configuration at a time, while existing variability-aware tools typically require complex setup and incur high analysis costs. In this article, we present an empirical study on the use of foundation models to detect and fix compilation errors caused by feature variability in configurable C systems. We evaluate GPT-OSS-20B and GEMINI 3 PRO, and compare them with TYPECHEF, a state-of-the-art variability-aware parser. Our evaluation considers two complementary settings: 5,000 small configurable systems designed to systematically exercise variability-induced compilation behavior, comprising both systems with and without compilation errors, and 14 real-world GitHub commits, as well as an additional set of mutation testing scenarios (42). Our results show that foundation models can effectively identify variability-induced compilation errors. On small configurable systems, GPT-OSS-20B achieved a precision of 0.97, recall of 0.90, and accuracy of 0.94, substantially increasing detection coverage compared to TYPECHEF, and exhibiting performance comparable to GEMINI 3. For compilation error repair, GPT-OSS-20B produced compilable fixes in over 70% of the cases. In the analysis of real commits, CHATGPT-5.2 detected all injected faults except for two cases and identified a potential real compilation bug in a Linux commit with more than 1,000 modified lines. Our findings indicate that current state-of-the-art foundation models provide a practical and low-effort complement to traditional variability-aware analyses.
Related papers
- Outrunning LLM Cutoffs: A Live Kernel Crash Resolution Benchmark for All [57.23434868678603]
Live-kBench is an evaluation framework for self-evolving benchmarks that scrapes and evaluates agents on freshly discovered kernel bugs.<n> kEnv is an agent-agnostic crash-resolution environment for kernel compilation, execution, and feedback.<n>Using kEnv, we benchmark three state-of-the-art agents, showing that they resolve 74% of crashes on the first attempt.
arXiv Detail & Related papers (2026-02-02T19:06:15Z) - Isolating Compiler Faults via Multiple Pairs of Adversarial Compilation Configurations [13.835199384689645]
MultiConf is a novel approach that automatically isolates compiler faults by constructing multiple pairs of adversarial compilation configurations.<n>We evaluate MultiConf on a benchmark of 60 real-world GCC compiler bugs.<n>In particular, MultiConf successfully localizes 27 out of 60 bugs at the Top-1 file level, representing improvements of 35.0% and 28.6% over the two state-of-the-art approaches.
arXiv Detail & Related papers (2025-12-27T09:40:35Z) - Metacognitive Self-Correction for Multi-Agent System via Prototype-Guided Next-Execution Reconstruction [58.51530390018909]
Large Language Model based multi-agent systems excel at collaborative problem solving but remain brittle to cascading errors.<n>We present MASC, a metacognitive framework that endows MAS with real-time, unsupervised, step-level error detection and self-correction.
arXiv Detail & Related papers (2025-10-16T05:35:37Z) - EdiVal-Agent: An Object-Centric Framework for Automated, Fine-Grained Evaluation of Multi-Turn Editing [170.71134330650796]
EdiVal-Agent is an object-centric evaluation framework for instruction-based image editing.<n>It is designed to assess not only standard single-turn but also multi-turn instruction-based editing with precision.<n>We build EdiVal-Bench, a benchmark covering 9 instruction types and 13 state-of-the-art editing models spanning in-context, flow-matching, and diffusion paradigms.
arXiv Detail & Related papers (2025-09-16T17:45:39Z) - From Embeddings to Equations: Genetic-Programming Surrogates for Interpretable Transformer Classification [9.17282078449475]
We study symbolic surrogate modeling of frozen Transformer embeddings to obtain compact, auditable classifiers with calibrated probabilities.<n>For five benchmarks (SST2G, 20NG, MNIST, CIFAR10, MSC17), embeddings from ModernBERT, DINOv2, and SigLIP are partitioned on the training set into disjoint, information-preserving views.<n>A cooperative multi-population genetic program (MEGP) then learns additive, closed-form logit programs over these views.
arXiv Detail & Related papers (2025-09-16T02:17:04Z) - Improving Compiler Bug Isolation by Leveraging Large Language Models [14.679589768900621]
We propose an innovative compiler bug isolation approach named AutoCBI.<n>We evaluate AutoCBI against state-of-the-art approaches (DiWi, RecBi and FuseFL) on 120 real-world bugs from the widely-used GCC and LLVM compilers.<n>Specifically, AutoCBI isolates 66.67%/69.23%, 300%/340%, and 100%/57.14% more bugs than RecBi, DiWi, and FuseFL, respectively, in the Top-1 ranked results for GCC/LLVM.
arXiv Detail & Related papers (2025-06-21T09:09:30Z) - Fine-Grained 1-Day Vulnerability Detection in Binaries via Patch Code Localization [12.73365645156957]
1-day vulnerabilities in binaries have become a major threat to software security.<n>patch presence test is one of the effective ways to detect the vulnerability.<n>We propose a novel approach named PLocator, which leverages stable values from both the patch code and its context.
arXiv Detail & Related papers (2025-01-29T04:35:37Z) - Evaluating the Capability of LLMs in Identifying Compilation Errors in Configurable Systems [1.2928804566606342]
This study evaluates the efficacy of Large Language Models (LLMs), specifically ChatGPT4, Le Chat Mistral and Gemini Advanced 1.5.
ChatGPT4 successfully identified most compilation errors in individual products.
Le Chat Mistral and Gemini Advanced 1.5 detected some of them.
arXiv Detail & Related papers (2024-07-26T21:07:21Z) - Diagnosing and Rectifying Fake OOD Invariance: A Restructured Causal
Approach [51.012396632595554]
Invariant representation learning (IRL) encourages the prediction from invariant causal features to labels de-confounded from the environments.
Recent theoretical results verified that some causal features recovered by IRLs merely pretend domain-invariantly in the training environments but fail in unseen domains.
We develop an approach based on conditional mutual information with respect to RS-SCM, then rigorously rectify the spurious and fake invariant effects.
arXiv Detail & Related papers (2023-12-15T12:58:05Z) - RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z) - Fast and Accurate Error Simulation for CNNs against Soft Errors [64.54260986994163]
We present a framework for the reliability analysis of Conal Neural Networks (CNNs) via an error simulation engine.
These error models are defined based on the corruption patterns of the output of the CNN operators induced by faults.
We show that our methodology achieves about 99% accuracy of the fault effects w.r.t. SASSIFI, and a speedup ranging from 44x up to 63x w.r.t.FI, that only implements a limited set of error models.
arXiv Detail & Related papers (2022-06-04T19:45:02Z) - Understanding Factual Errors in Summarization: Errors, Summarizers,
Datasets, Error Detectors [105.12462629663757]
In this work, we aggregate factuality error annotations from nine existing datasets and stratify them according to the underlying summarization model.
We compare performance of state-of-the-art factuality metrics, including recent ChatGPT-based metrics, on this stratified benchmark and show that their performance varies significantly across different types of summarization models.
arXiv Detail & Related papers (2022-05-25T15:26:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.