SecEncoder: Logs are All You Need in Security
- URL: http://arxiv.org/abs/2411.07528v1
- Date: Tue, 12 Nov 2024 03:56:07 GMT
- Title: SecEncoder: Logs are All You Need in Security
- Authors: Muhammed Fatih Bulut, Yingqi Liu, Naveed Ahmad, Maximilian Turner, Sami Ait Ouahmane, Cameron Andrews, Lloyd Greenwald,
- Abstract summary: This paper introduces SecEncoder, a specialized small language model that is pretrained using security logs.
Experimental results indicate that SecEncoder outperforms other LMs, such as BERTa-v3-large and OpenAI's Embedding (emtext-ada) models.
- Score: 8.591459170396698
- License:
- Abstract: Large and Small Language Models (LMs) are typically pretrained using extensive volumes of text, which are sourced from publicly accessible platforms such as Wikipedia, Book Corpus, or through web scraping. These models, due to their exposure to a wide range of language data, exhibit impressive generalization capabilities and can perform a multitude of tasks simultaneously. However, they often fall short when it comes to domain-specific tasks due to their broad training data. This paper introduces SecEncoder, a specialized small language model that is pretrained using security logs. SecEncoder is designed to address the domain-specific limitations of general LMs by focusing on the unique language and patterns found in security logs. Experimental results indicate that SecEncoder outperforms other LMs, such as BERTlarge, DeBERTa-v3-large and OpenAI's Embedding (textembedding-ada-002) models, which are pretrained mainly on natural language, across various tasks. Furthermore, although SecEncoder is primarily pretrained on log data, it outperforms models pretrained on natural language for a range of tasks beyond log analysis, such as incident prioritization and threat intelligence document retrieval. This suggests that domain specific pretraining with logs can significantly enhance the performance of LMs in security. These findings pave the way for future research into security-specific LMs and their potential applications.
Related papers
- Studying and Benchmarking Large Language Models For Log Level Suggestion [49.176736212364496]
Large Language Models (LLMs) have become a focal point of research across various domains.
This paper investigates the impact of characteristics and learning paradigms on the performance of 12 open-source LLMs in log level suggestion.
arXiv Detail & Related papers (2024-10-11T03:52:17Z) - Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders [68.00224057755773]
We focus on the membership leakage of pre-training data exposed through downstream models adapted from pre-trained language encoders.
Our evaluations reveal, for the first time, the existence of membership leakage even when only the black-box output of the downstream model is exposed.
arXiv Detail & Related papers (2024-08-20T17:55:15Z) - Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts.
Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models.
The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z) - Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models [36.58320580210008]
We show that certain special characters or their combinations with English letters are stronger memory triggers, leading to more severe data leakage.
We propose a simple but effective Special Characters Attack (SCA) to induce training data leakage.
arXiv Detail & Related papers (2024-05-09T02:35:32Z) - Traces of Memorisation in Large Language Models for Code [16.125924759649106]
Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet.
We compare the rate of memorisation with large language models trained on natural language.
We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts.
arXiv Detail & Related papers (2023-12-18T19:12:58Z) - PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind)
Our work offers a theoretical analysis for model design and benchmarks various techniques.
In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z) - Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types.
Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z) - Language Model for Text Analytic in Cybersecurity [6.93939291118954]
Language models are crucial in text analytics and NLP.
In this paper, we propose a cybersecurity language model called SecureBERT.
SecureBERT is able to capture the text connotations in the cybersecurity domain.
arXiv Detail & Related papers (2022-04-06T09:17:21Z) - Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings.
We demonstrate that this framework enables effective generalization across different environments.
For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z) - Style Attuned Pre-training and Parameter Efficient Fine-tuning for
Spoken Language Understanding [19.105304214638075]
We introduce a novel framework for learning spoken language understanding.
The framework consists of a conversational language modeling (CLM) pre-training task and a light encoder architecture.
With the framework, we match the performance of state-of-the-art SLU results on Alexa internal datasets and on two public ones, adding only 4.4% parameters per task.
arXiv Detail & Related papers (2020-10-09T03:53:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.