CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain
- URL: http://arxiv.org/abs/2212.02974v1
- Date: Tue, 6 Dec 2022 13:49:12 GMT
- Title: CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain
- Authors: Markus Bayer, Philipp Kuehn, Ramin Shanehsaz, Christian Reuter
- Abstract summary: We present a language model specifically tailored to the cybersecurity domain.
The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks.
We show that our approach against catastrophic works, as the model is able to retrieve the previously trained domain-independent knowledge.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The field of cybersecurity is evolving fast. Experts need to be informed
about past, current and - in the best case - upcoming threats, because attacks
are becoming more advanced, targets bigger and systems more complex. As this
cannot be addressed manually, cybersecurity experts need to rely on machine
learning techniques. In the texutual domain, pre-trained language models like
BERT have shown to be helpful, by providing a good baseline for further
fine-tuning. However, due to the domain-knowledge and many technical terms in
cybersecurity general language models might miss the gist of textual
information, hence doing more harm than good. For this reason, we create a
high-quality dataset and present a language model specifically tailored to the
cybersecurity domain, which can serve as a basic building block for
cybersecurity systems that deal with natural language. The model is compared
with other models based on 15 different domain-dependent extrinsic and
intrinsic tasks as well as general tasks from the SuperGLUE benchmark. On the
one hand, the results of the intrinsic tasks show that our model improves the
internal representation space of words compared to the other models. On the
other hand, the extrinsic, domain-dependent tasks, consisting of sequence
tagging and classification, show that the model is best in specific application
scenarios, in contrast to the others. Furthermore, we show that our approach
against catastrophic forgetting works, as the model is able to retrieve the
previously trained domain-independent knowledge. The used dataset and trained
model are made publicly available
Related papers
- SentinelLMs: Encrypted Input Adaptation and Fine-tuning of Language
Models for Private and Secure Inference [6.0189674528771]
This paper addresses the privacy and security concerns associated with deep neural language models.
Deep neural language models serve as crucial components in various modern AI-based applications.
We propose a novel method to adapt and fine-tune transformer-based language models on passkey-encrypted user-specific text.
arXiv Detail & Related papers (2023-12-28T19:55:11Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Grounded Decoding: Guiding Text Generation with Grounded Models for
Embodied Agents [111.15288256221764]
Grounded-decoding project aims to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
We frame this as a problem similar to probabilistic filtering: decode a sequence that both has high probability under the language model and high probability under a set of grounded model objectives.
We demonstrate how such grounded models can be obtained across three simulation and real-world domains, and that the proposed decoding strategy is able to solve complex, long-horizon tasks in a robotic setting by leveraging the knowledge of both models.
arXiv Detail & Related papers (2023-03-01T22:58:50Z) - Exploring the Limits of Transfer Learning with Unified Model in the
Cybersecurity Domain [17.225973170682604]
We introduce a generative multi-task model, Unified Text-to-Text Cybersecurity (UTS)
UTS is trained on malware reports, phishing site URLs, programming code constructs, social media data, blogs, news articles, and public forum posts.
We show UTS improves the performance of some cybersecurity datasets.
arXiv Detail & Related papers (2023-02-20T22:21:26Z) - Language Models are General-Purpose Interfaces [109.45478241369655]
We propose to use language models as a general-purpose interface to various foundation models.
A collection of pretrained encoders perceive diverse modalities (such as vision, and language)
We propose a semi-causal language modeling objective to jointly pretrain the interface and the modular encoders.
arXiv Detail & Related papers (2022-06-13T17:34:22Z) - What Language Model Architecture and Pretraining Objective Work Best for
Zero-Shot Generalization? [50.84738303888189]
We present a large-scale evaluation of modeling choices and their impact on zero-shot generalization.
We train models with over 5 billion parameters for more than 170 billion tokens.
We find that pretrained causal decoder models can be efficiently adapted into non-causal decoder models.
arXiv Detail & Related papers (2022-04-12T14:19:49Z) - Language Model for Text Analytic in Cybersecurity [6.93939291118954]
Language models are crucial in text analytics and NLP.
In this paper, we propose a cybersecurity language model called SecureBERT.
SecureBERT is able to capture the text connotations in the cybersecurity domain.
arXiv Detail & Related papers (2022-04-06T09:17:21Z) - Adapt-and-Distill: Developing Small, Fast and Effective Pretrained
Language Models for Domains [45.07506437436464]
We present a general approach to developing small, fast and effective pre-trained models for specific domains.
This is achieved by adapting the off-the-shelf general pre-trained models and performing task-agnostic knowledge distillation in target domains.
arXiv Detail & Related papers (2021-06-25T07:37:05Z) - Unsupervised Paraphrasing with Pretrained Language Models [85.03373221588707]
We propose a training pipeline that enables pre-trained language models to generate high-quality paraphrases in an unsupervised setting.
Our recipe consists of task-adaptation, self-supervision, and a novel decoding algorithm named Dynamic Blocking.
We show with automatic and human evaluations that our approach achieves state-of-the-art performance on both the Quora Question Pair and the ParaNMT datasets.
arXiv Detail & Related papers (2020-10-24T11:55:28Z) - InfoBERT: Improving Robustness of Language Models from An Information
Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks.
Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks.
We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.