Related papers: Rethinking the Capability of Fine-Tuned Language Models for Automated Vulnerability Repair

Rethinking the Capability of Fine-Tuned Language Models for Automated Vulnerability Repair

URL: http://arxiv.org/abs/2512.22633v1
Date: Sat, 27 Dec 2025 16:12:43 GMT
Title: Rethinking the Capability of Fine-Tuned Language Models for Automated Vulnerability Repair
Authors: Woorim Han, Yeongjun Kwak, Miseon Yu, Kyeongmin Kim, Younghan Lee, Hyungon Moon, Yunheung Paek,
Abstract summary: Learning-based automated vulnerability repair (AVR) techniques that utilize fine-tuned language models have shown promise in generating vulnerability patches.<n>Our empirical study reveals that state-of-the-art models often overfit to the training set and are evaluated using training, validation, and test sets that are not mutually exclusive.<n>We introduce L-AVRBench, a test-based benchmark tailored for learning-based, to overcome the limitations of match-based metrics and examine the models' true repair capabilities.
Score: 5.847724760751716
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Learning-based automated vulnerability repair (AVR) techniques that utilize fine-tuned language models have shown promise in generating vulnerability patches. However, questions remain about their ability to repair unseen vulnerabilities. Our empirical study reveals that state-of-the-art models often overfit to the training set and are evaluated using training, validation, and test sets that are not mutually exclusive. Furthermore, relying on match-based metrics that compare generated patches to reference fixes at the token level has some limitations, failing to account for the possibility of various valid ways to patch the vulnerability. In this paper, we examine the capabilities of state-of-the-art fine-tuned AVR models and the adequacy of match-based evaluation metrics in three ways. First, we apply semantic-preserving transformations to test sets in order to determine whether models truly learn robust vulnerability-repair patterns or simply rely on spurious features. Second, we re-split the training, validation, and test sets to be mutually exclusive and evaluate the models on the revised test set to assess their generalization capabilities. Third, we introduce L-AVRBench, a test-based benchmark tailored for learning-based AVR, to overcome the limitations of match-based metrics and examine the AVR models' true repair capabilities.

Related papers

Inference-time Unlearning Using Conformal Prediction [13.479885316485209]
Unlearning is the process of efficiently removing specific information from a trained machine learning model without retraining from scratch.<n>This paper introduces a framework that iteratively refines the quality of the generated responses using feedback from the verifier without updating the model parameters.<n>This paper's approach significantly outperforms existing state-of-the-art methods, reducing unlearning error by up to 93% across challenging unlearning benchmarks.
arXiv Detail & Related papers (2026-02-03T17:46:50Z)
Learning to Repair Lean Proofs from Compiler Feedback [4.55626337217127]
We study Lean proof repair as a supervised learning problem.<n>We introduce APRIL (Automated Proof Repair in Lean), a dataset of 260,000 supervised theorems.<n>We view diagnostic-conditioned supervision as a complementary training signal for feedback-using provers.
arXiv Detail & Related papers (2026-02-03T01:53:56Z)
Reference-Specific Unlearning Metrics Can Hide the Truth: A Reality Check [60.77691669644931]
We propose Functional Alignment for Distributional Equivalence (FADE), a novel metric that measures distributional similarity between unlearned and reference models.<n>We show that FADE captures functional alignment across the entire output distribution, providing a principled assessment of genuine unlearning.<n>These findings expose fundamental gaps in current evaluation practices and demonstrate that FADE provides a more robust foundation for developing and assessing truly effective unlearning methods.
arXiv Detail & Related papers (2025-10-14T20:50:30Z)
Semantics-Aligned, Curriculum-Driven, and Reasoning-Enhanced Vulnerability Repair Framework [15.17681731375364]
SeCuRepair is a semantics-aligned, curriculum-driven, and reasoning-enhanced framework for vulnerability repair.<n>At its core, SeCuRepair adopts a reason-then-edit paradigm, requiring the model to articulate why and how a vulnerability should be fixed.<n>SeCuRepair also moves beyond traditional supervised fine-tuning and employs semantics-aware reinforcement learning.
arXiv Detail & Related papers (2025-10-01T15:09:27Z)
Scalable and Robust LLM Unlearning by Correcting Responses with Retrieved Exclusions [49.55618517046225]
Language models trained on web-scale corpora risk memorizing and exposing sensitive information.<n>We propose Corrective Unlearning with Retrieved Exclusions (CURE), a novel unlearning framework.<n>CURE verifies model outputs for leakage and revises them into safe responses.
arXiv Detail & Related papers (2025-09-30T09:07:45Z)
RePaCA: Leveraging Reasoning Large Language Models for Static Automated Patch Correctness Assessment [0.0]
We introduce RePaCA, a novel static APCA technique that leverages Large Language Models (LLMs) specialized in thinking tasks.<n>Our approach achieves state-of-the-art performance, with 83.1% accuracy and an 84.8% F1-score.
arXiv Detail & Related papers (2025-07-30T11:21:09Z)
It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective [14.271145160443462]
VulTegra compares scratch-trained and pre-trained DL models for vulnerability detection.<n>State-of-the-art (SOTA) detectors still suffer from low consistency, limited real-world capabilities, and scalability challenges.
arXiv Detail & Related papers (2025-07-13T08:02:56Z)
ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning.<n>This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation.<n>Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z)
Self-Improvement in Language Models: The Sharpening Mechanism [70.9248553790022]
We offer a new perspective on the capabilities of self-improvement through a lens we refer to as sharpening.<n>Motivated by the observation that language models are often better at verifying response quality than they are at generating correct responses, we formalize self-improvement as using the model itself as a verifier during post-training.<n>We analyze two natural families of self-improvement algorithms based on SFT and RLHF.
arXiv Detail & Related papers (2024-12-02T20:24:17Z)
A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check. Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models. The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z)
Uncovering the Limits of Machine Learning for Automatic Vulnerability Detection [12.529028629599349]
We propose a novel benchmarking methodology to help researchers better evaluate the true capabilities and limits of ML4VD techniques. Using six ML4VD techniques and two datasets, we find (a) that state-of-the-art models severely overfit to unrelated features for predicting the vulnerabilities in the testing data, and (b) that the performance gained by data augmentation does not generalize beyond the specific augmentations applied during training.
arXiv Detail & Related papers (2023-06-28T08:41:39Z)
Position: AI Evaluation Should Learn from How We Test Humans [65.36614996495983]
We argue that psychometrics, a theory originating in the 20th century for human assessment, could be a powerful solution to the challenges in today's AI evaluations.
arXiv Detail & Related papers (2023-06-18T09:54:33Z)
A Unified Evaluation of Textual Backdoor Learning: Frameworks and Benchmarks [72.7373468905418]
We develop an open-source toolkit OpenBackdoor to foster the implementations and evaluations of textual backdoor learning. We also propose CUBE, a simple yet strong clustering-based defense baseline.
arXiv Detail & Related papers (2022-06-17T02:29:23Z)
A Principled Approach to Failure Analysis and Model Repairment: Demonstration in Medical Imaging [12.732665048388041]
Machine learning models commonly exhibit unexpected failures post-deployment. We aim to standardise and bring principles to this process through answering two critical questions. We suggest that the quality of the identified failure types can be validated through measuring the intra- and inter-type generalisation. We argue that a model can be considered repaired if it achieves high accuracy on the failure types while retaining performance on the previously correct data.
arXiv Detail & Related papers (2021-09-25T12:04:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.