From LLMs to Agents: A Comparative Evaluation of LLMs and LLM-based Agents in Security Patch Detection
- URL: http://arxiv.org/abs/2511.08060v1
- Date: Wed, 12 Nov 2025 01:37:14 GMT
- Title: From LLMs to Agents: A Comparative Evaluation of LLMs and LLM-based Agents in Security Patch Detection
- Authors: Junxiao Han, Zheng Yu, Lingfeng Bao, Jiakun Liu, Yao Wan, Jianwei Yin, Shuiguang Deng, Song Han,
- Abstract summary: Large language models (LLMs) and LLM-based agents have demonstrated remarkable capabilities in various software engineering tasks.<n>We conduct a comprehensive evaluation of the performance of LLMs and LLM-based agents for security patch detection.<n>Our findings reveal that the Data-Aug LLM achieves the best overall performance, whereas the ReAct Agent demonstrates the lowest false positive rate (FPR)
- Score: 42.089851895083804
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The widespread adoption of open-source software (OSS) has accelerated software innovation but also increased security risks due to the rapid propagation of vulnerabilities and silent patch releases. In recent years, large language models (LLMs) and LLM-based agents have demonstrated remarkable capabilities in various software engineering (SE) tasks, enabling them to effectively address software security challenges such as vulnerability detection. However, systematic evaluation of the capabilities of LLMs and LLM-based agents in security patch detection remains limited. To bridge this gap, we conduct a comprehensive evaluation of the performance of LLMs and LLM-based agents for security patch detection. Specifically, we investigate three methods: Plain LLM (a single LLM with a system prompt), Data-Aug LLM (data augmentation based on the Plain LLM), and the ReAct Agent (leveraging the thought-action-observation mechanism). We also evaluate the performance of both commercial and open-source LLMs under these methods and compare these results with those of existing baselines. Furthermore, we analyze the detection performance of these methods across various vulnerability types, and examine the impact of different prompting strategies and context window sizes on the results. Our findings reveal that the Data-Aug LLM achieves the best overall performance, whereas the ReAct Agent demonstrates the lowest false positive rate (FPR). Although baseline methods exhibit strong accuracy, their false positive rates are significantly higher. In contrast, our evaluated methods achieve comparable accuracy while substantially reducing the FPR. These findings provide valuable insights into the practical applications of LLMs and LLM-based agents in security patch detection, highlighting their advantage in maintaining robust performance while minimizing false positive rates.
Related papers
- On the Effectiveness of Instruction-Tuning Local LLMs for Identifying Software Vulnerabilities [0.7136933021609079]
Large Language Models (LLMs) show significant promise in automating software vulnerability analysis.<n>Current approaches in using LLMs to automate vulnerability analysis mostly rely on using online API-based LLM services.<n>This paper addresses these limitations by reformulating the problem as Software Vulnerability Identification (SVI)<n>We show that instruct-tuned local models represent a more effective, secure, and practical approach for leveraging LLMs in real-world vulnerability management.
arXiv Detail & Related papers (2025-12-23T05:30:53Z) - Evaluating LLMs for One-Shot Patching of Real and Artificial Vulnerabilities [2.5190317156807924]
We empirically evaluate the patching effectiveness and complementarity of several prominent Large Language Models (LLMs)<n>Our results reveal that LLMs patch real vulnerabilities more effectively compared to artificial ones.<n>Our analysis reveals significant variability across LLMs in terms of overlapping (multiple LLMs patching the same vulnerabilities) and complementarity.
arXiv Detail & Related papers (2025-11-28T18:03:47Z) - Towards Effective Complementary Security Analysis using Large Language Models [3.203446435054805]
A key challenge in security analysis is the manual evaluation of potential security weaknesses generated by static application security testing (SAST) tools.<n>We propose using Large Language Models (LLMs) to improve the assessment of SAST findings.
arXiv Detail & Related papers (2025-06-20T10:46:35Z) - LLM Embedding-based Attribution (LEA): Quantifying Source Contributions to Generative Model's Response for Vulnerability Analysis [1.3543506826034255]
Large Language Models (LLMs) are increasingly used for cybersecurity threat analysis, but their deployment in security-sensitive environments raises trust and safety concerns.<n>This work proposes Embedding Attribution (LEA) to analyze the generated responses for vulnerability exploitation analysis.<n>Our results demonstrate LEA's ability to detect clear distinctions between non-retrieval, generic-retrieval, and valid-retrieval scenarios with over 95% accuracy on larger models.
arXiv Detail & Related papers (2025-06-12T21:20:10Z) - Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - Uncovering Safety Risks of Large Language Models through Concept Activation Vector [13.804245297233454]
We introduce a Safety Concept Activation Vector (SCAV) framework to guide attacks on large language models (LLMs)<n>We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks.<n>Our attack method significantly improves the attack success rate and response quality while requiring less training data.
arXiv Detail & Related papers (2024-04-18T09:46:25Z) - Multitask-based Evaluation of Open-Source LLM on Software Vulnerability [2.7692028382314815]
This paper proposes a pipeline for quantitatively evaluating interactive Large Language Models (LLMs) using publicly available datasets.
We carry out an extensive technical evaluation of LLMs using Big-Vul covering four different common software vulnerability tasks.
We find that the existing state-of-the-art approaches and pre-trained Language Models (LMs) are generally superior to LLMs in software vulnerability detection.
arXiv Detail & Related papers (2024-04-02T15:52:05Z) - Comprehensive Reassessment of Large-Scale Evaluation Outcomes in LLMs: A Multifaceted Statistical Approach [64.42462708687921]
Evaluations have revealed that factors such as scaling, training types, architectures and other factors profoundly impact the performance of LLMs.
Our study embarks on a thorough re-examination of these LLMs, targeting the inadequacies in current evaluation methods.
This includes the application of ANOVA, Tukey HSD tests, GAMM, and clustering technique.
arXiv Detail & Related papers (2024-03-22T14:47:35Z) - Is LLM-as-a-Judge Robust? Investigating Universal Adversarial Attacks on Zero-shot LLM Assessment [8.948475969696075]
Large Language Models (LLMs) are powerful zero-shot assessors used in real-world situations such as assessing written exams and benchmarking systems.
We show that short universal adversarial phrases can be deceived to judge LLMs to predict inflated scores.
It is found that judge-LLMs are significantly more susceptible to these adversarial attacks when used for absolute scoring.
arXiv Detail & Related papers (2024-02-21T18:55:20Z) - Benchmarking LLMs via Uncertainty Quantification [91.72588235407379]
The proliferation of open-source Large Language Models (LLMs) has highlighted the urgent need for comprehensive evaluation methods.
We introduce a new benchmarking approach for LLMs that integrates uncertainty quantification.
Our findings reveal that: I) LLMs with higher accuracy may exhibit lower certainty; II) Larger-scale LLMs may display greater uncertainty compared to their smaller counterparts; and III) Instruction-finetuning tends to increase the uncertainty of LLMs.
arXiv Detail & Related papers (2024-01-23T14:29:17Z) - Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets.
Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets.
We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z) - Assessing the Reliability of Large Language Model Knowledge [78.38870272050106]
Large language models (LLMs) have been treated as knowledge bases due to their strong performance in knowledge probing tasks.
How do we evaluate the capabilities of LLMs to consistently produce factually correct answers?
We propose MOdel kNowledge relIabiliTy scORe (MONITOR), a novel metric designed to directly measure LLMs' factual reliability.
arXiv Detail & Related papers (2023-10-15T12:40:30Z) - Survey on Factuality in Large Language Models: Knowledge, Retrieval and
Domain-Specificity [61.54815512469125]
This survey addresses the crucial issue of factuality in Large Language Models (LLMs)
As LLMs find applications across diverse domains, the reliability and accuracy of their outputs become vital.
arXiv Detail & Related papers (2023-10-11T14:18:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.