Related papers: Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

URL: http://arxiv.org/abs/2502.11191v1
Date: Sun, 16 Feb 2025 16:34:49 GMT
Title: Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training
Authors: Yao-Ching Yu, Tsun-Han Chiang, Cheng-Wei Tsai, Chien-Ming Huang, Wen-Kwang Tsao,
Abstract summary: Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine.<n>We present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation.<n>Continual pre-training on our dataset yields a 15.88% improvement in the aggregate score, while reasoning distillation leads to a 10% gain in security certification.
Score: 1.5029560229270191
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.88% improvement in the aggregate score, while reasoning distillation leads to a 10% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer to https://huggingface.co/collections/trendmicro-ailab/primus-67b1fd27052b802b4af9d243.

Related papers

Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report [50.268821168513654]
We present Foundation-Sec-8B, a cybersecurity-focused large language model (LLMs) built on the Llama 3.1 architecture. We evaluate it across both established and new cybersecurity benchmarks, showing that it matches Llama 3.1-70B and GPT-4o-mini in certain cybersecurity-specific tasks. By releasing our model to the public, we aim to accelerate progress and adoption of AI-driven tools in both public and private cybersecurity contexts.
arXiv Detail & Related papers (2025-04-28T08:41:12Z)
Stealing Training Data from Large Language Models in Decentralized Training through Activation Inversion Attack [53.823990570014494]
Decentralized training has become a resource-efficient framework to democratize the training of large language models (LLMs) This paper identifies a novel and realistic attack surface: the privacy leakage from training data in decentralized training.
arXiv Detail & Related papers (2025-02-22T05:19:20Z)
LLM-PBE: Assessing Data Privacy in Large Language Models [111.58198436835036]
Large Language Models (LLMs) have become integral to numerous domains, significantly advancing applications in data management, mining, and analysis. Despite the critical nature of this issue, there has been no existing literature to offer a comprehensive assessment of data privacy risks in LLMs. Our paper introduces LLM-PBE, a toolkit crafted specifically for the systematic evaluation of data privacy risks in LLMs.
arXiv Detail & Related papers (2024-08-23T01:37:29Z)
CyberPal.AI: Empowering LLMs with Expert-Driven Cybersecurity Instructions [0.2999888908665658]
Large Language Models (LLMs) have significantly advanced natural language processing (NLP) capabilities, providing versatile capabilities across various applications. However, their application to complex, domain-specific tasks, such as cyber-security, often faces substantial challenges. In this study, we introduce SecKnowledge and CyberPal.AI to address these challenges and train security-expert LLMs.
arXiv Detail & Related papers (2024-08-17T22:37:39Z)
Large Language Models for Cyber Security: A Systematic Literature Review [14.924782327303765]
We conduct a comprehensive review of the literature on the application of Large Language Models in cybersecurity (LLM4Security) We observe that LLMs are being applied to a wide range of cybersecurity tasks, including vulnerability detection, malware analysis, network intrusion detection, and phishing detection. Third, we identify several promising techniques for adapting LLMs to specific cybersecurity domains, such as fine-tuning, transfer learning, and domain-specific pre-training.
arXiv Detail & Related papers (2024-05-08T02:09:17Z)
SafetyPrompts: a Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety [27.843894102000608]
We conduct a first systematic review of open datasets for evaluating and improving large language models (LLMs) safety. We highlight trends, such as a trend towards fully synthetic datasets, as well as gaps in dataset coverage, such as a clear lack of non-English and naturalistic datasets. Our contributions are based on SafetyPrompts.com, a living catalogue of open datasets for LLM safety.
arXiv Detail & Related papers (2024-04-08T10:57:25Z)
The WMDP Benchmark: Measuring and Reducing Malicious Use With Unlearning [87.1610740406279]
White House Executive Order on Artificial Intelligence highlights the risks of large language models (LLMs) empowering malicious actors in developing biological, cyber, and chemical weapons. Current evaluations are private, preventing further research into mitigating risk. We publicly release the Weapons of Mass Destruction Proxy benchmark, a dataset of 3,668 multiple-choice questions.
arXiv Detail & Related papers (2024-03-05T18:59:35Z)
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN) At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself. This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z)
A Survey of Data Security: Practices from Cybersecurity and Challenges of Machine Learning [6.086388464254366]
Machine learning (ML) is increasingly being deployed in critical systems. The data dependence of ML makes securing data used to train and test ML-enabled systems of utmost importance. Data science and cybersecurity domains adhere to their own set of skills and terminologies.
arXiv Detail & Related papers (2023-10-06T18:15:35Z)
Privacy Side Channels in Machine Learning Systems [87.53240071195168]
We introduce privacy side channels: attacks that exploit system-level components to extract private information. For example, we show that deduplicating training data before applying differentially-private training creates a side-channel that completely invalidates any provable privacy guarantees. We further show that systems which block language models from regenerating training data can be exploited to exfiltrate private keys contained in the training set.
arXiv Detail & Related papers (2023-09-11T16:49:05Z)
Privately Fine-Tuning Large Language Models with Differential Privacy [10.485556506301549]
Pre-trained Large Language Models (LLMs) are an integral part of modern AI that have led to breakthrough performances in complex AI tasks. Differential privacy (DP) provides a rigorous framework that allows adding noise in the process of training or fine-tuning LLMs. We present ewtune, a DP framework for fine-tuning LLMs based on Edgeworth accountant with finite-sample privacy guarantees.
arXiv Detail & Related papers (2022-10-26T21:18:31Z)
IELM: An Open Information Extraction Benchmark for Pre-Trained Language Models [75.48081086368606]
We introduce a new open information extraction (OIE) benchmark for pre-trained language models (LM) We create an OIE benchmark aiming to fully examine the open relational information present in the pre-trained LMs. Surprisingly, pre-trained LMs are able to obtain competitive performance on both standard OIE datasets.
arXiv Detail & Related papers (2022-10-25T16:25:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.