Historian: Reducing Manual Validation in APR Benchmarking via Evidence-Based Assessment
- URL: http://arxiv.org/abs/2603.00649v1
- Date: Sat, 28 Feb 2026 13:41:29 GMT
- Title: Historian: Reducing Manual Validation in APR Benchmarking via Evidence-Based Assessment
- Authors: Sahand Moslemi, Mayasah Lami, Anil Koyuncu,
- Abstract summary: We present Historian, a framework that leverages Large Language Models to perform multi-reference comparisons against a knowledge base of historically validated patches.<n>In leave-one-tool-out evaluation, Historian achieves 95.0% coverage with 88.4% accuracy, reducing manual validation to 5% of patches.
- Score: 0.19853810231896352
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Assessing the correctness of patches generated by Automated Program Repair (APR) is a major bottleneck. Manual validation is labor-intensive and limited: exact matching overlooks valid variants, while semantic inspection is subjective and hard to reproduce. Existing Automated Patch Correctness Assessment (APCA) often relies on opaque predictive models that treat each patch as novel, repeatedly re-assessing semantically redundant patches. Our analysis of a large corpus of tool-generated patches reveals a duality: about 39% of unique correct patches are syntactic clones, suggesting opportunities for automation, yet about 65% of bugs have multiple distinct correct fixes, making single-reference assessment insufficient. We present Historian, a framework that leverages Large Language Models to perform multi-reference comparisons against a knowledge base of historically validated patches, producing traceable, evidence-based verdicts while conservatively isolating novel cases as Unknown. In leave-one-tool-out evaluation, Historian achieves 95.0% coverage with 88.4% accuracy, reducing manual validation to 5% of patches. As an evidence-based pre-filter, enhancing the accuracy of standalone APCA tools by up to 21.8% and enabling a hybrid pipeline with 86.2% overall accuracy and 100% coverage. A longitudinal analysis of tool-generated patches (2020-2024) shows that redundancy in repair attempts is common, indicating that many patches repeatedly rediscover established ones and strengthening the sustainability of evidence-based APR assessment.
Related papers
- HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam [63.84155758655084]
Humanity's Last Exam (HLE) has become a widely used benchmark for evaluating frontier large language models.<n>We introduce HLE-Verified, a verified and revised version of HLE with a transparent verification protocol and fine-grained error taxonomy.<n>We evaluate seven state-of-the-art language models on HLE and HLE-Verified, observing an average absolute accuracy gain of 7--10 percentage points.
arXiv Detail & Related papers (2026-02-15T02:50:15Z) - PRIME: A Process-Outcome Alignment Benchmark for Verifiable Reasoning in Mathematics and Engineering [71.15346406323827]
We introduce PRIME, a benchmark for evaluating verifiers on Process-Outcome Alignment verification.<n>We find that current verifiers frequently fail to detect derivation flaws.<n>We propose a process-aware RLVR training paradigm utilizing verifiers selected via PRIME.
arXiv Detail & Related papers (2026-02-12T04:45:01Z) - See Less, Drive Better: Generalizable End-to-End Autonomous Driving via Foundation Models Stochastic Patch Selection [51.59559387222532]
Recent advances in end-to-end autonomous driving show that policies trained on patch-aligned features generalize better to Out-of-Distribution (OOD)<n>We present $2.4-Patch-Selection (SPS), a simple yet effective approach for learning policies that are more robust, generalizable, and efficient.
arXiv Detail & Related papers (2026-01-15T18:58:33Z) - VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z) - RePaCA: Leveraging Reasoning Large Language Models for Static Automated Patch Correctness Assessment [0.0]
We introduce RePaCA, a novel static APCA technique that leverages Large Language Models (LLMs) specialized in thinking tasks.<n>Our approach achieves state-of-the-art performance, with 83.1% accuracy and an 84.8% F1-score.
arXiv Detail & Related papers (2025-07-30T11:21:09Z) - Parameter-Efficient Fine-Tuning with Attributed Patch Semantic Graph for Automated Patch Correctness Assessment [8.028183762381474]
Automated program repair (APR) aims to automatically repair program errors without human intervention.<n>Many research efforts have been devoted for automated patch correctness assessment ( APCA)
arXiv Detail & Related papers (2025-05-05T13:15:53Z) - All Patches Matter, More Patches Better: Enhance AI-Generated Image Detection via Panoptic Patch Learning [45.37237171823581]
The exponential growth of AI-generated images (AIGIs) underscores the urgent need for robust and generalizable detection methods.<n>In this paper, we establish two key principles for AIGI detection through systematic analysis.
arXiv Detail & Related papers (2025-04-02T06:32:09Z) - Are "Solved Issues" in SWE-bench Really Solved Correctly? An Empirical Study [18.117047833029073]
Most popular benchmarks for automated issue solving are SWE-bench and its human-filtered subset SWE-bench Verified.<n>This paper presents an in-depth empirical study of the correctness of plausible patches generated by three state-of-the-art issue-solving tools evaluated on SWE-bench Verified.
arXiv Detail & Related papers (2025-03-19T14:02:21Z) - Improving Bias Correction Standards by Quantifying its Effects on Treatment Outcomes [54.18828236350544]
Propensity score matching (PSM) addresses selection biases by selecting comparable populations for analysis.
Different matching methods can produce significantly different Average Treatment Effects (ATE) for the same task, even when meeting all validation criteria.
To address this issue, we introduce a novel metric, A2A, to reduce the number of valid matches.
arXiv Detail & Related papers (2024-07-20T12:42:24Z) - RAP-Gen: Retrieval-Augmented Patch Generation with CodeT5 for Automatic
Program Repair [75.40584530380589]
We propose a novel Retrieval-Augmented Patch Generation framework (RAP-Gen)
RAP-Gen explicitly leveraging relevant fix patterns retrieved from a list of previous bug-fix pairs.
We evaluate RAP-Gen on three benchmarks in two programming languages, including the TFix benchmark in JavaScript, and Code Refinement and Defects4J benchmarks in Java.
arXiv Detail & Related papers (2023-09-12T08:52:56Z) - PatchCensor: Patch Robustness Certification for Transformers via
Exhaustive Testing [7.88628640954152]
Vision Transformer (ViT) is known to be highly nonlinear like other classical neural networks and could be easily fooled by both natural and adversarial patch perturbations.
This limitation could pose a threat to the deployment of ViT in the real industrial environment, especially in safety-critical scenarios.
We propose PatchCensor, aiming to certify the patch robustness of ViT by applying exhaustive testing.
arXiv Detail & Related papers (2021-11-19T23:45:23Z) - Checking Patch Behaviour against Test Specification [4.723400023753107]
We propose a hypothesis on how the link between the patch behaviour and failing test specifications can be drawn.
We then propose BATS, an unsupervised learning-based system to predict patch correctness.
arXiv Detail & Related papers (2021-07-28T11:39:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.