The Semantic Trap: Do Fine-tuned LLMs Learn Vulnerability Root Cause or Just Functional Pattern?
- URL: http://arxiv.org/abs/2601.22655v2
- Date: Mon, 02 Feb 2026 09:48:57 GMT
- Title: The Semantic Trap: Do Fine-tuned LLMs Learn Vulnerability Root Cause or Just Functional Pattern?
- Authors: Feiyang Huang, Yuqiang Sun, Fan Zhang, Ziqi Yang, Han Liu, Yang Liu,
- Abstract summary: We propose TrapEval, a comprehensive evaluation framework designed to disentangle vulnerability root cause from functional pattern.<n>We fine-tune five representative state-of-the-art LLMs across three model families and evaluate them under cross-dataset testing, semantic-preservings, and varying degrees of semantic gap measured by CodeBLEU.<n>Our findings serve as a wake-up call: high benchmark scores on traditional datasets may be illusory, masking the model's inability to understand the true causal logic of vulnerabilities.
- Score: 14.472036099680961
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: LLMs demonstrate promising performance in software vulnerability detection after fine-tuning. However, it remains unclear whether these gains reflect a genuine understanding of vulnerability root causes or merely an exploitation of functional patterns. In this paper, we identify a critical failure mode termed the "semantic trap," where fine-tuned LLMs achieve high detection scores by associating certain functional domains with vulnerability likelihood rather than reasoning about the underlying security semantics. To systematically evaluate this phenomenon, we propose TrapEval, a comprehensive evaluation framework designed to disentangle vulnerability root cause from functional pattern. TrapEval introduces two complementary datasets derived from real-world open-source projects: V2N, which pairs vulnerable code with unrelated benign code, and V2P, which pairs vulnerable code with its corresponding patched version, forcing models to distinguish near-identical code that differs only in subtle security-critical logic. Using TrapEval, we fine-tune five representative state-of-the-art LLMs across three model families and evaluate them under cross-dataset testing, semantic-preserving perturbations, and varying degrees of semantic gap measured by CodeBLEU. Our empirical results reveal that, despite improvements in metrics, fine-tuned LLMs consistently struggle to distinguish vulnerable code from its patched counterpart, exhibit severe robustness degradation under minor semantic-preserving transformations, and rely heavily on functional-context shortcuts when the semantic gap is small. These findings provide strong evidence that current fine-tuning practices often fail to impart true vulnerability reasoning. Our findings serve as a wake-up call: high benchmark scores on traditional datasets may be illusory, masking the model's inability to understand the true causal logic of vulnerabilities.
Related papers
- TraceGuard: Process-Guided Firewall against Reasoning Backdoors in Large Language Models [19.148124494194317]
We propose TraceGuard, a process-guided security framework that transforms small-scale models into robust reasoning firewalls.<n>Our approach treats the reasoning trace as an untrusted payload and establishes a defense-in-depth strategy.<n>We demonstrate robustness against adaptive adversaries in a grey-box setting, establishing TraceGuard as a viable, low-latency security primitive.
arXiv Detail & Related papers (2026-03-02T22:19:13Z) - Semantics-Preserving Evasion of LLM Vulnerability Detectors [14.476903104601154]
LLM-based vulnerability detectors are increasingly deployed in security-critical code review.<n>We evaluate detection-time integrity under a semantics-preserving threat model.<n>We introduce a metric of joint robustness across different attack methods/carriers.
arXiv Detail & Related papers (2026-01-30T20:54:27Z) - RealSec-bench: A Benchmark for Evaluating Secure Code Generation in Real-World Repositories [58.32028251925354]
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, but their proficiency in producing secure code remains a critical, under-explored area.<n>We introduce RealSec-bench, a new benchmark for secure code generation meticulously constructed from real-world, high-risk Java repositories.
arXiv Detail & Related papers (2026-01-30T08:29:01Z) - Why Does the LLM Stop Computing: An Empirical Study of User-Reported Failures in Open-Source LLMs [50.075587392477935]
We conduct the first large-scale empirical study of 705 real-world failures from the open-source DeepSeek, Llama, and Qwen ecosystems.<n>Our analysis reveals a paradigm shift: white-box orchestration relocates the reliability bottleneck from model algorithmic defects to the systemic fragility of the deployment stack.
arXiv Detail & Related papers (2026-01-20T06:42:56Z) - SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking [58.475471437150674]
We propose sequential watermarking for soft prompts (SWAP)<n>SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes.<n>Experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential adaptive attacks.
arXiv Detail & Related papers (2025-11-05T13:48:48Z) - DeLeaker: Dynamic Inference-Time Reweighting For Semantic Leakage Mitigation in Text-to-Image Models [55.30555646945055]
Text-to-Image (T2I) models are vulnerable to semantic leakage.<n>We introduce DeLeaker, a lightweight approach that mitigates leakage by directly intervening on the model's attention maps.<n>SLIM is the first dataset dedicated to semantic leakage.
arXiv Detail & Related papers (2025-10-16T17:39:21Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z) - Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks [0.5218155982819203]
Large Language Models (LLMs) are increasingly used as code assistants.<n>This study examines a more direct threat: open-source LLMs generating vulnerable code when prompted.
arXiv Detail & Related papers (2025-07-14T08:36:26Z) - Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data [22.557961978833386]
We propose a novel framework for large language models (LLMs) that excels at mining vulnerability patterns.<n>Specifically, we construct forward and backward reasoning processes for vulnerability and corresponding fixed code, ensuring the synthesis of high-quality reasoning data.<n>We show that ReVD sets new state-of-the-art for LLM-based software vulnerability detection, e.g., 12.24%-22.77% improvement in the accuracy.
arXiv Detail & Related papers (2025-06-09T03:25:23Z) - Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask [30.819697001992154]
Large Language Models are a promising tool for automated vulnerability detection.<n>Despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities?<n>This paper challenges three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales.
arXiv Detail & Related papers (2025-04-18T05:32:47Z) - Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories [8.583591493627276]
We introduce JitVul, a vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits.<n>We show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code.
arXiv Detail & Related papers (2025-03-05T15:22:24Z) - FactCHD: Benchmarking Fact-Conflicting Hallucination Detection [64.4610684475899]
FactCHD is a benchmark designed for the detection of fact-conflicting hallucinations from LLMs.
FactCHD features a diverse dataset that spans various factuality patterns, including vanilla, multi-hop, comparison, and set operation.
We introduce Truth-Triangulator that synthesizes reflective considerations by tool-enhanced ChatGPT and LoRA-tuning based on Llama2.
arXiv Detail & Related papers (2023-10-18T16:27:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.