ReVul-CoT: Towards Effective Software Vulnerability Assessment with Retrieval-Augmented Generation and Chain-of-Thought Prompting
- URL: http://arxiv.org/abs/2511.17027v1
- Date: Fri, 21 Nov 2025 08:01:49 GMT
- Title: ReVul-CoT: Towards Effective Software Vulnerability Assessment with Retrieval-Augmented Generation and Chain-of-Thought Prompting
- Authors: Zhijie Chen, Xiang Chen, Ziming Li, Jiacheng Xue, Chaoyang Gao,
- Abstract summary: We propose a novel framework that integrates Retrieval-Augmented Generation (RAG) with Chain-of-Thought (COT) prompting.<n>In ReVul-CoT, the RAG module dynamically retrieves contextually relevant information from a constructed local knowledge base.<n>Building on DeepSeek-V3.1, CoT prompting guides the LLM to perform step-by-step reasoning over exploitability, impact scope, and related factors.
- Score: 9.735224996021591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Context: Software Vulnerability Assessment (SVA) plays a vital role in evaluating and ranking vulnerabilities in software systems to ensure their security and reliability. Objective: Although Large Language Models (LLMs) have recently shown remarkable potential in SVA, they still face two major limitations. First, most LLMs are trained on general-purpose corpora and thus lack domain-specific knowledge essential for effective SVA. Second, they tend to rely on shallow pattern matching instead of deep contextual reasoning, making it challenging to fully comprehend complex code semantics and their security implications. Method: To alleviate these limitations, we propose a novel framework ReVul-CoT that integrates Retrieval-Augmented Generation (RAG) with Chain-of-Thought (COT) prompting. In ReVul-CoT, the RAG module dynamically retrieves contextually relevant information from a constructed local knowledge base that consolidates vulnerability data from authoritative sources (such as NVD and CWE), along with corresponding code snippets and descriptive information. Building on DeepSeek-V3.1, CoT prompting guides the LLM to perform step-by-step reasoning over exploitability, impact scope, and related factors Results: We evaluate ReVul-CoT on a dataset of 12,070 vulnerabilities. Experimental results show that ReVul-CoT outperforms state-of-the-art SVA baselines by 16.50%-42.26% in terms of MCC, and outperforms the best baseline by 10.43%, 15.86%, and 16.50% in Accuracy, F1-score, and MCC, respectively. Our ablation studies further validate the contributions of considering dynamic retrieval, knowledge integration, and CoT-based reasoning. Conclusion: Our results demonstrate that combining RAG with CoT prompting significantly enhances LLM-based SVA and points out promising directions for future research.
Related papers
- Multi-hop Reasoning via Early Knowledge Alignment [68.28168992785896]
Early Knowledge Alignment (EKA) aims to align Large Language Models with contextually relevant retrieved knowledge.<n>EKA significantly improves retrieval precision, reduces cascading errors, and enhances both performance and efficiency.<n>EKA proves effective as a versatile, training-free inference strategy that scales seamlessly to large models.
arXiv Detail & Related papers (2025-12-23T08:14:44Z) - VULPO: Context-Aware Vulnerability Detection via On-Policy LLM Optimization [2.6678231901651723]
This paper introduces Vulnerability-Adaptive Policy Optimization (VULPO), an on-policy LLM reinforcement learning framework for context-aware vulnerability detection.<n>To support training and evaluation, we first construct ContextVul, a new dataset that augments high-quality function-level samples with lightweight method to extract repository-level context information.<n>To address the asymmetric difficulty of different vulnerability cases and mitigate reward hacking, VULPO incorporates label-level and sample-level difficulty-adaptive reward scaling.
arXiv Detail & Related papers (2025-11-14T21:57:48Z) - RAG-Zeval: Towards Robust and Interpretable Evaluation on RAG Responses through End-to-End Rule-Guided Reasoning [64.46921169261852]
RAG-Zeval is a novel end-to-end framework that formulates faithfulness and correctness evaluation as a rule-guided reasoning task.<n>Our approach trains evaluators with reinforcement learning, facilitating compact models to generate comprehensive and sound assessments.<n>Experiments demonstrate RAG-Zeval's superior performance, achieving the strongest correlation with human judgments.
arXiv Detail & Related papers (2025-05-28T14:55:33Z) - SVA-ICL: Improving LLM-based Software Vulnerability Assessment via In-Context Learning and Information Fusion [5.06185582943982]
We introduce a novel approach SVA-ICL, which leverages in-context learning (ICL) to enhance large language models (LLMs) performance.<n>We evaluate the effectiveness of SVA-ICL using a large-scale dataset comprising 12,071 C/C++ vulnerabilities.
arXiv Detail & Related papers (2025-05-15T06:43:32Z) - Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.<n>Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?<n>This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z) - R2Vul: Learning to Reason about Software Vulnerabilities with Reinforcement Learning and Structured Reasoning Distillation [11.15622721122057]
Large language models (LLMs) have shown promising performance in software vulnerability detection, yet their reasoning capabilities remain unreliable.<n>We propose R2Vul, a method that combines reinforcement learning from AI feedback (RLAIF) and structured reasoning distillation.<n>We evaluate R2Vul across five programming languages and against four static analysis tools.
arXiv Detail & Related papers (2025-04-07T03:04:16Z) - Reasoning with LLMs for Zero-Shot Vulnerability Detection [0.9208007322096533]
We present textbfVulnSage, a comprehensive evaluation framework and a curated dataset from diverse, large-scale open-source system software projects.<n>The framework supports multi-granular analysis across function, file, and inter-function levels.<n>It employs four diverse zero-shot prompt strategies: Baseline, Chain-of-context, Think, and Think & verify.
arXiv Detail & Related papers (2025-03-22T23:59:17Z) - REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models.<n>REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values.<n>We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv Detail & Related papers (2025-03-20T07:54:35Z) - The Dual-use Dilemma in LLMs: Do Empowering Ethical Capacities Make a Degraded Utility? [54.18519360412294]
Large Language Models (LLMs) must balance between rejecting harmful requests for safety and accommodating legitimate ones for utility.<n>This paper presents a Direct Preference Optimization (DPO) based alignment framework that achieves better overall performance.<n>We analyze experimental results obtained from testing DeepSeek-R1 on our benchmark and reveal the critical ethical concerns raised by this highly acclaimed model.
arXiv Detail & Related papers (2025-01-20T06:35:01Z) - SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z) - SAFE: Advancing Large Language Models in Leveraging Semantic and Syntactic Relationships for Software Vulnerability Detection [23.7268575752712]
Software vulnerabilities (SVs) have emerged as a prevalent and critical concern for safety-critical security systems.
We propose a novel framework that enhances the capability of large language models to learn and utilize semantic and syntactic relationships from source code data for SVD.
arXiv Detail & Related papers (2024-09-02T00:49:02Z) - RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework [66.93260816493553]
This paper introduces RAGEval, a framework designed to assess RAG systems across diverse scenarios.<n>With a focus on factual accuracy, we propose three novel metrics: Completeness, Hallucination, and Irrelevance.<n> Experimental results show that RAGEval outperforms zero-shot and one-shot methods in terms of clarity, safety, conformity, and richness of generated samples.
arXiv Detail & Related papers (2024-08-02T13:35:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.