Related papers: SecEncoder: Logs are All You Need in Security

SecEncoder: Logs are All You Need in Security

URL: http://arxiv.org/abs/2411.07528v1
Date: Tue, 12 Nov 2024 03:56:07 GMT
Title: SecEncoder: Logs are All You Need in Security
Authors: Muhammed Fatih Bulut, Yingqi Liu, Naveed Ahmad, Maximilian Turner, Sami Ait Ouahmane, Cameron Andrews, Lloyd Greenwald,
Abstract summary: This paper introduces SecEncoder, a specialized small language model that is pretrained using security logs. Experimental results indicate that SecEncoder outperforms other LMs, such as BERTa-v3-large and OpenAI's Embedding (emtext-ada) models.
Score: 8.591459170396698
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large and Small Language Models (LMs) are typically pretrained using extensive volumes of text, which are sourced from publicly accessible platforms such as Wikipedia, Book Corpus, or through web scraping. These models, due to their exposure to a wide range of language data, exhibit impressive generalization capabilities and can perform a multitude of tasks simultaneously. However, they often fall short when it comes to domain-specific tasks due to their broad training data. This paper introduces SecEncoder, a specialized small language model that is pretrained using security logs. SecEncoder is designed to address the domain-specific limitations of general LMs by focusing on the unique language and patterns found in security logs. Experimental results indicate that SecEncoder outperforms other LMs, such as BERTlarge, DeBERTa-v3-large and OpenAI's Embedding (textembedding-ada-002) models, which are pretrained mainly on natural language, across various tasks. Furthermore, although SecEncoder is primarily pretrained on log data, it outperforms models pretrained on natural language for a range of tasks beyond log analysis, such as incident prioritization and threat intelligence document retrieval. This suggests that domain specific pretraining with logs can significantly enhance the performance of LMs in security. These findings pave the way for future research into security-specific LMs and their potential applications.

Related papers

Detecting Hard-Coded Credentials in Software Repositories via LLMs [0.0]
Software developers frequently hard-code credentials such as passwords, generic secrets, private keys, and generic tokens in software repositories.<n>These credentials create attack surfaces exploitable by a potential adversary to conduct malicious exploits such as backdoor attacks.<n>Recent detection efforts utilize embedding models to vectorize textual credentials before passing them to classifiers for predictions.<n>Our model outperforms the current state-of-the-art by 13% in F1 measure on the benchmark dataset.
arXiv Detail & Related papers (2025-06-16T04:33:48Z)
Enhancing Code Generation for Low-Resource Languages: No Silver Bullet [55.39571645315926]
Large Language Models (LLMs) rely on large and diverse datasets to learn syntax, semantics, and usage patterns of programming languages. For low-resource languages, the limited availability of such data hampers the models' ability to generalize effectively. We present an empirical study investigating the effectiveness of several approaches for boosting LLMs' performance on low-resource languages.
arXiv Detail & Related papers (2025-01-31T12:23:28Z)
Studying and Benchmarking Large Language Models For Log Level Suggestion [49.176736212364496]
Large Language Models (LLMs) have become a focal point of research across various domains. This paper investigates the impact of characteristics and learning paradigms on the performance of 12 open-source LLMs in log level suggestion.
arXiv Detail & Related papers (2024-10-11T03:52:17Z)
Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders [68.00224057755773]
We focus on the membership leakage of pre-training data exposed through downstream models adapted from pre-trained language encoders. Our evaluations reveal, for the first time, the existence of membership leakage even when only the black-box output of the downstream model is exposed.
arXiv Detail & Related papers (2024-08-20T17:55:15Z)
Prompting Encoder Models for Zero-Shot Classification: A Cross-Domain Study in Italian [75.94354349994576]
This paper explores the feasibility of employing smaller, domain-specific encoder LMs alongside prompting techniques to enhance performance in specialized contexts. Our study concentrates on the Italian bureaucratic and legal language, experimenting with both general-purpose and further pre-trained encoder-only models. The results indicate that while further pre-trained models may show diminished robustness in general knowledge, they exhibit superior adaptability for domain-specific tasks, even in a zero-shot setting.
arXiv Detail & Related papers (2024-07-30T08:50:16Z)
Special Characters Attack: Toward Scalable Training Data Extraction From Large Language Models [36.58320580210008]
We show that certain special characters or their combinations with English letters are stronger memory triggers, leading to more severe data leakage. We propose a simple but effective Special Characters Attack (SCA) to induce training data leakage.
arXiv Detail & Related papers (2024-05-09T02:35:32Z)
Traces of Memorisation in Large Language Models for Code [16.125924759649106]
Large language models for code are commonly trained on large unsanitised corpora of source code scraped from the internet. We compare the rate of memorisation with large language models trained on natural language. We find that large language models for code are vulnerable to data extraction attacks, like their natural language counterparts.
arXiv Detail & Related papers (2023-12-18T19:12:58Z)
PrivacyMind: Large Language Models Can Be Contextual Privacy Protection Learners [81.571305826793]
We introduce Contextual Privacy Protection Language Models (PrivacyMind) Our work offers a theoretical analysis for model design and benchmarks various techniques. In particular, instruction tuning with both positive and negative examples stands out as a promising method.
arXiv Detail & Related papers (2023-10-03T22:37:01Z)
Language Model Pre-Training with Sparse Latent Typing [66.75786739499604]
We propose a new pre-training objective, Sparse Latent Typing, which enables the model to sparsely extract sentence-level keywords with diverse latent types. Experimental results show that our model is able to learn interpretable latent type categories in a self-supervised manner without using any external knowledge.
arXiv Detail & Related papers (2022-10-23T00:37:08Z)
Language Model for Text Analytic in Cybersecurity [6.93939291118954]
Language models are crucial in text analytics and NLP. In this paper, we propose a cybersecurity language model called SecureBERT. SecureBERT is able to capture the text connotations in the cybersecurity domain.
arXiv Detail & Related papers (2022-04-06T09:17:21Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)
Style Attuned Pre-training and Parameter Efficient Fine-tuning for Spoken Language Understanding [19.105304214638075]
We introduce a novel framework for learning spoken language understanding. The framework consists of a conversational language modeling (CLM) pre-training task and a light encoder architecture. With the framework, we match the performance of state-of-the-art SLU results on Alexa internal datasets and on two public ones, adding only 4.4% parameters per task.
arXiv Detail & Related papers (2020-10-09T03:53:37Z)

This list is automatically generated from the titles and abstracts of the papers in this site.