Related papers: Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models

Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models

URL: http://arxiv.org/abs/2602.06687v1
Date: Fri, 06 Feb 2026 13:19:45 GMT
Title: Evaluating and Enhancing the Vulnerability Reasoning Capabilities of Large Language Models
Authors: Li Lu, Yanjie Zhao, Hongzhou Rao, Kechi Zhang, Haoyu Wang,
Abstract summary: We propose DAGVul, a novel framework that models vulnerability reasoning as a Directed Acyclic Graph (DAG) generation task.<n>By further introducing Reinforcement Learning with Verifiable Rewards (RLVR), we align model reasoning trace with program-intrinsic logic.<n>Our framework improves the reasoning F1-score by an average of 18.9% over all the baselines.
Score: 15.849480549367684
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency in vulnerability detection. However, a critical reliability gap persists: models frequently yield correct detection verdicts based on hallucinated logic or superficial patterns that deviate from the actual root cause. This misalignment remains largely obscured because contemporary benchmarks predominantly prioritize coarse-grained classification metrics, lacking the granular ground truth required to evaluate the underlying reasoning process. To bridge this gap, we first construct a benchmark consisting of two datasets: (1) real-world vulnerabilities with expert-curated causal reasoning as ground truth, and (2) semantically equivalent code perturbations for assessing reasoning robustness. Our large-scale empirical study reveals that even state-of-the-art models struggle to maintain logical consistency during semantic code comprehension, exhibiting 12 systematic failure patterns. Addressing these limitations, we propose DAGVul, a novel framework that models vulnerability reasoning as a Directed Acyclic Graph (DAG) generation task. Unlike linear chain-of-thought (CoT), our approach explicitly maps causal dependencies to enforce structural consistency. By further introducing Reinforcement Learning with Verifiable Rewards (RLVR), we align model reasoning trace with program-intrinsic logic. Experimental results demonstrate that our framework improves the reasoning F1-score by an average of 18.9% over all the baselines. Remarkably, our 8B-parameter implementation not only outperforms existing models of comparable scale but also surpasses specialized large-scale reasoning models, including Qwen3-30B-Reasoning and GPT-OSS-20B-High. It is even competitive with state-of-the-art models like Claude-Sonnet-4.5 (75.47% vs. 76.11%), establishing new efficiency in vulnerability reasoning across model scales.

Related papers

Amortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language Models [2.5170433424424874]
Reinforcement Learning with Verifiable Rewards has established itself as the dominant paradigm for instilling rigorous reasoning capabilities in Large Language Models.<n>We identify a critical pathology in this alignment process: the systematic suppression of valid but rare (low-likelihood under the base model distribution) reasoning paths.<n>We propose Amortized Reasoning Tree Search (ARTS) to counteract this collapse without discarding the base model's latent diversity.
arXiv Detail & Related papers (2026-02-13T11:52:50Z)
STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction [78.0692157478247]
We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning.<n>We show that STAR consistently outperforms all baselines on both score-based and rank-based metrics.
arXiv Detail & Related papers (2026-02-12T16:30:07Z)
Pushing the Boundaries of Natural Reasoning: Interleaved Bonus from Formal-Logic Verification [49.506412445511934]
Large Language Models (LLMs) show remarkable capabilities, yet their next-token prediction creates logical inconsistencies and reward hacking.<n>We introduce a formal logic verification-guided framework that dynamically interleaves formal symbolic verification with the natural language generation process.<n>We operationalize this framework via a novel two-stage training pipeline that synergizes formal logic verification-guided supervised fine-tuning and policy optimization.
arXiv Detail & Related papers (2026-01-30T07:01:25Z)
Why Self-Rewarding Works: Theoretical Guarantees for Iterative Alignment of Language Models [50.248686344277246]
Self-Rewarding Language Models (SRLMs) achieve notable success in iteratively improving alignment without external feedback.<n>This paper provides the first rigorous theoretical guarantees for SRLMs.
arXiv Detail & Related papers (2026-01-30T03:45:43Z)
EpiCaR: Knowing What You Don't Know Matters for Better Reasoning in LLMs [9.412828452977553]
Existing approaches reinforce successful reasoning paths, incurring a substantial calibration cost.<n>This failure has been characterized as a form of model collapse in alignment.<n>We proposeEpiCaR as a training objective that jointly optimize reasoning performance and calibration.
arXiv Detail & Related papers (2026-01-11T06:21:13Z)
The Drill-Down and Fabricate Test (DDFT): A Protocol for Measuring Epistemic Robustness in Language Models [0.0]
Current language model evaluations measure what models know under ideal conditions but not how robustly they know it under realistic stress.<n>We introduce the Drill-Down Fabricate Test (DDFT), a protocol that measures robustness.<n>We find flagship models exhibit brittleness despite their scale, while smaller models can achieve robust performance.
arXiv Detail & Related papers (2025-12-29T20:29:09Z)
Catch Me If You Can: How Smaller Reasoning Models Pretend to Reason with Mathematical Fidelity [15.774418410083515]
We introduce a diagnostic framework that distinguishes genuine mathematical reasoning from superficial pattern matching.<n>We reveal a striking disconnect between surface performance and reasoning fidelity.<n>Our diagnostics expose reasoning failures invisible to traditional accuracy metrics.
arXiv Detail & Related papers (2025-11-29T16:47:01Z)
Causal Reasoning in Pieces: Modular In-Context Learning for Causal Discovery [6.72184534513047]
Causal inference remains a fundamental challenge for large language models.<n>Recent advances in internal reasoning with large language models have sparked interest.<n>We study causal discovery on the Corr2Cause benchmark using the OpenAI's o-series and DeepSeek-R model families.
arXiv Detail & Related papers (2025-07-31T12:10:27Z)
Lost at the Beginning of Reasoning [85.17612793300238]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.
arXiv Detail & Related papers (2025-06-27T09:53:57Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)
CodeCrash: Exposing LLM Fragility to Misleading Natural Language in Code Reasoning [40.88253756147561]
We introduce CodeCrash, a stress-testing framework with 1,279 questions from CruxEval and LiveCodeBench.<n>We find that models often shortcut reasoning by over-relying on NL cues, leading to an average performance degradation of 23.2% in output prediction tasks.<n>Even with Chain-of-Thought reasoning, models on average still have a 13.8% drop due to distractibility and rationalization.
arXiv Detail & Related papers (2025-04-19T00:40:28Z)
Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z)
Advancing Counterfactual Inference through Nonlinear Quantile Regression [77.28323341329461]
We propose a framework for efficient and effective counterfactual inference implemented with neural networks. The proposed approach enhances the capacity to generalize estimated counterfactual outcomes to unseen data. Empirical results conducted on multiple datasets offer compelling support for our theoretical assertions.
arXiv Detail & Related papers (2023-06-09T08:30:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.