Related papers: Using LLMs for Security Advisory Investigations: How Far Are We?

Using LLMs for Security Advisory Investigations: How Far Are We?

URL: http://arxiv.org/abs/2506.13161v1
Date: Mon, 16 Jun 2025 07:17:34 GMT
Title: Using LLMs for Security Advisory Investigations: How Far Are We?
Authors: Bayu Fedra Abdullah, Yusuf Sulistyo Nugroho, Brittany Reid, Raula Gaikovina Kula, Kazumasa Shimari, Kenichi Matsumoto,
Abstract summary: Large Language Models (LLMs) are increasingly used in software security, but their trustworthiness in generating accurate vulnerability advisories remains uncertain.<n>This study investigates the ability of ChatGPT to (1) generate plausible security advisories from CVE-IDs, (2) differentiate real from fake CVE-IDs, and (3) extract CVE-IDs from advisory descriptions.
Score: 2.916588882952662
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) are increasingly used in software security, but their trustworthiness in generating accurate vulnerability advisories remains uncertain. This study investigates the ability of ChatGPT to (1) generate plausible security advisories from CVE-IDs, (2) differentiate real from fake CVE-IDs, and (3) extract CVE-IDs from advisory descriptions. Using a curated dataset of 100 real and 100 fake CVE-IDs, we manually analyzed the credibility and consistency of the model's outputs. The results show that ChatGPT generated plausible security advisories for 96% of given input real CVE-IDs and 97% of given input fake CVE-IDs, demonstrating a limitation in differentiating between real and fake IDs. Furthermore, when these generated advisories were reintroduced to ChatGPT to identify their original CVE-ID, the model produced a fake CVE-ID in 6% of cases from real advisories. These findings highlight both the strengths and limitations of ChatGPT in cybersecurity applications. While the model demonstrates potential for automating advisory generation, its inability to reliably authenticate CVE-IDs or maintain consistency upon re-evaluation underscores the risks associated with its deployment in critical security tasks. Our study emphasizes the importance of using LLMs with caution in cybersecurity workflows and suggests the need for further improvements in their design to improve reliability and applicability in security advisory generation.

Related papers

From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs [23.210122086674048]
CVE-GENIE is an automated framework designed to reproduce real-world vulnerabilities.<n>It reproduces 51% (428 of 841) CVEs published in 2024-2025, complete with their verifiable exploits, at an average cost of $2.77 per CVE.<n>Our pipeline offers a robust method to generate reproducible CVE benchmarks, valuable for diverse applications.
arXiv Detail & Related papers (2025-09-01T23:37:44Z)
Mind the Generation Process: Fine-Grained Confidence Estimation During LLM Generation [63.49409574310576]
Large language models (LLMs) exhibit overconfidence, assigning high confidence scores to incorrect predictions.<n>We introduce FineCE, a novel confidence estimation method that delivers accurate, fine-grained confidence scores during text generation.<n>Our code and all baselines used in the paper are available on GitHub.
arXiv Detail & Related papers (2025-08-16T13:29:35Z)
Confidential Guardian: Cryptographically Prohibiting the Abuse of Model Abstention [65.47632669243657]
A dishonest institution can exploit mechanisms to discriminate or unjustly deny services under the guise of uncertainty.<n>We demonstrate the practicality of this threat by introducing an uncertainty-inducing attack called Mirage.<n>We propose Confidential Guardian, a framework that analyzes calibration metrics on a reference dataset to detect artificially suppressed confidence.
arXiv Detail & Related papers (2025-05-29T19:47:50Z)
Towards Trustworthy GUI Agents: A Survey [64.6445117343499]
This survey examines the trustworthiness of GUI agents in five critical dimensions.<n>We identify major challenges such as vulnerability to adversarial attacks, cascading failure modes in sequential decision-making.<n>As GUI agents become more widespread, establishing robust safety standards and responsible development practices is essential.
arXiv Detail & Related papers (2025-03-30T13:26:00Z)
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models.<n>REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values.<n>We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv Detail & Related papers (2025-03-20T07:54:35Z)
ParamMute: Suppressing Knowledge-Critical FFNs for Faithful Retrieval-Augmented Generation [91.20492150248106]
Large language models (LLMs) integrated with retrieval-augmented generation (RAG) have improved factuality by grounding outputs in external evidence.<n>We investigate the internal mechanisms behind unfaithful generation and identify a subset of mid-to-deep feed-forward networks (FFNs) that are disproportionately activated in such cases.<n>We propose Parametric Knowledge Muting through FFN Suppression (ParamMute), a framework that improves contextual faithfulness by suppressing the activation of unfaithfulness-associated FFNs and calibrating the model toward retrieved knowledge.
arXiv Detail & Related papers (2025-02-21T15:50:41Z)
Streamlining Security Vulnerability Triage with Large Language Models [0.786186571320448]
We present CASEY, a novel approach that automates the identification of Common Weaknessions (CWEs) of security bugs and assesses their severity.<n>Casey achieved a CWE identification accuracy of 68%, a severity identification accuracy of 73.6%, and a combined accuracy of 51.2%.
arXiv Detail & Related papers (2025-01-31T06:02:24Z)
CWEval: Outcome-driven Evaluation on Functionality and Security of LLM Code Generation [20.72188827088484]
Large Language Models (LLMs) have significantly aided developers by generating or assisting in code writing.<n> detecting vulnerabilities in functionally correct code is more challenging, especially for developers with limited security knowledge.<n>We introduce CWEval, a novel outcome-driven evaluation framework designed to enhance the evaluation of secure code generation by LLMs.
arXiv Detail & Related papers (2025-01-14T15:27:01Z)
Agent-SafetyBench: Evaluating the Safety of LLM Agents [72.92604341646691]
We introduce Agent-SafetyBench, a benchmark designed to evaluate the safety of large language models (LLMs)<n>Agent-SafetyBench encompasses 349 interaction environments and 2,000 test cases, evaluating 8 categories of safety risks and covering 10 common failure modes frequently encountered in unsafe interactions.<n>Our evaluation of 16 popular LLM agents reveals a concerning result: none of the agents achieves a safety score above 60%.
arXiv Detail & Related papers (2024-12-19T02:35:15Z)
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey [59.26328612791924]
Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs) We propose a unified framework that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy.
arXiv Detail & Related papers (2024-09-16T09:06:44Z)
Trust, but Verify: Evaluating Developer Behavior in Mitigating Security Vulnerabilities in Open-Source Software Projects [0.11999555634662631]
This study investigates vulnerabilities in dependencies of sampled open-source software (OSS) projects. We have identified common issues in outdated or unmaintained dependencies, that pose significant security risks. Results suggest that reducing the number of direct dependencies and prioritizing well-established libraries with strong security records are effective strategies for enhancing the software security landscape.
arXiv Detail & Related papers (2024-08-26T13:46:48Z)
Cybersecurity Defenses: Exploration of CVE Types through Attack Descriptions [1.0474508494260908]
VULDAT is a classification tool using a sentence transformer MPNET to identify system vulnerabilities from attack descriptions. Our model was applied to 100 attack techniques from the ATT&CK repository and 685 issues from the CVE repository. Our findings indicate that our model achieves the best performance with F1 score of 0.85, Precision of 0.86, and Recall of 0.83.
arXiv Detail & Related papers (2024-07-09T11:08:35Z)
Automated CVE Analysis for Threat Prioritization and Impact Prediction [4.540236408836132]
We introduce our novel predictive model and tool (called CVEDrill) which revolutionizes CVE analysis and threat prioritization. CVEDrill accurately estimates the Common Vulnerability Scoring System (CVSS) vector for precise threat mitigation and priority ranking. It seamlessly automates the classification of CVEs into the appropriate Common Weaknession (CWE) hierarchy classes.
arXiv Detail & Related papers (2023-09-06T14:34:03Z)
Safety Assessment of Chinese Large Language Models [51.83369778259149]
Large language models (LLMs) may generate insulting and discriminatory content, reflect incorrect social values, and may be used for malicious purposes. To promote the deployment of safe, responsible, and ethical AI, we release SafetyPrompts including 100k augmented prompts and responses by LLMs.
arXiv Detail & Related papers (2023-04-20T16:27:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.