Related papers: Language Model for Text Analytic in Cybersecurity

Language Model for Text Analytic in Cybersecurity

URL: http://arxiv.org/abs/2204.02685v1
Date: Wed, 6 Apr 2022 09:17:21 GMT
Title: Language Model for Text Analytic in Cybersecurity
Authors: Ehsan Aghaei, Xi Niu, Waseem Shadid, Ehab Al-Shaer
Abstract summary: Language models are crucial in text analytics and NLP. In this paper, we propose a cybersecurity language model called SecureBERT. SecureBERT is able to capture the text connotations in the cybersecurity domain.
Score: 6.93939291118954
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: NLP is a form of artificial intelligence and machine learning concerned with a computer or machine's ability to understand and interpret human language. Language models are crucial in text analytics and NLP since they allow computers to interpret qualitative input and convert it to quantitative data that they can use in other tasks. In essence, in the context of transfer learning, language models are typically trained on a large generic corpus, referred to as the pre-training stage, and then fine-tuned to a specific underlying task. As a result, pre-trained language models are mostly used as a baseline model that incorporates a broad grasp of the context and may be further customized to be used in a new NLP task. The majority of pre-trained models are trained on corpora from general domains, such as Twitter, newswire, Wikipedia, and Web. Such off-the-shelf NLP models trained on general text may be inefficient and inaccurate in specialized fields. In this paper, we propose a cybersecurity language model called SecureBERT, which is able to capture the text connotations in the cybersecurity domain, and therefore could further be used in automation for many important cybersecurity tasks that would otherwise rely on human expertise and tedious manual efforts. SecureBERT is trained on a large corpus of cybersecurity text collected and preprocessed by us from a variety of sources in cybersecurity and the general computing domain. Using our proposed methods for tokenization and model weights adjustment, SecureBERT is not only able to preserve the understanding of general English as most pre-trained language models can do, but also effective when applied to text that has cybersecurity implications.

Related papers

SecEncoder: Logs are All You Need in Security [8.591459170396698]
This paper introduces SecEncoder, a specialized small language model that is pretrained using security logs. Experimental results indicate that SecEncoder outperforms other LMs, such as BERTa-v3-large and OpenAI's Embedding (emtext-ada) models.
arXiv Detail & Related papers (2024-11-12T03:56:07Z)
SentinelLMs: Encrypted Input Adaptation and Fine-tuning of Language Models for Private and Secure Inference [6.0189674528771]
This paper addresses the privacy and security concerns associated with deep neural language models. Deep neural language models serve as crucial components in various modern AI-based applications. We propose a novel method to adapt and fine-tune transformer-based language models on passkey-encrypted user-specific text.
arXiv Detail & Related papers (2023-12-28T19:55:11Z)
Combatting Human Trafficking in the Cyberspace: A Natural Language Processing-Based Methodology to Analyze the Language in Online Advertisements [55.2480439325792]
This project tackles the pressing issue of human trafficking in online C2C marketplaces through advanced Natural Language Processing (NLP) techniques. We introduce a novel methodology for generating pseudo-labeled datasets with minimal supervision, serving as a rich resource for training state-of-the-art NLP models. A key contribution is the implementation of an interpretability framework using Integrated Gradients, providing explainable insights crucial for law enforcement.
arXiv Detail & Related papers (2023-11-22T02:45:01Z)
Adapting Large Language Models to Domains via Reading Comprehension [86.24451681746676]
We explore how continued pre-training on domain-specific corpora influences large language models. We show that training on the raw corpora endows the model with domain knowledge, but drastically hurts its ability for question answering. We propose a simple method for transforming raw corpora into reading comprehension texts.
arXiv Detail & Related papers (2023-09-18T07:17:52Z)
Unsupervised Sentiment Analysis of Plastic Surgery Social Media Posts [91.3755431537592]
The massive collection of user posts across social media platforms is primarily untapped for artificial intelligence (AI) use cases. Natural language processing (NLP) is a subfield of AI that leverages bodies of documents, known as corpora, to train computers in human-like language understanding. This study demonstrates that the applied results of unsupervised analysis allow a computer to predict either negative, positive, or neutral user sentiment towards plastic surgery.
arXiv Detail & Related papers (2023-07-05T20:16:20Z)
Pre-Training to Learn in Context [138.0745138788142]
The ability of in-context learning is not fully exploited because language models are not explicitly trained to learn in context. We propose PICL (Pre-training for In-Context Learning), a framework to enhance the language models' in-context learning ability. Our experiments show that PICL is more effective and task-generalizable than a range of baselines, outperforming larger language models with nearly 4x parameters.
arXiv Detail & Related papers (2023-05-16T03:38:06Z)
Foundation Models for Natural Language Processing -- Pre-trained Language Models Integrating Media [0.0]
Foundation Models are pre-trained language models for Natural Language Processing. They can be applied to a wide range of different media and problem domains, ranging from image and video processing to robot control learning. This book provides a comprehensive overview of the state of the art in research and applications of Foundation Models.
arXiv Detail & Related papers (2023-02-16T20:42:04Z)
CySecBERT: A Domain-Adapted Language Model for the Cybersecurity Domain [0.0]
We present a language model specifically tailored to the cybersecurity domain. The model is compared with other models based on 15 different domain-dependent extrinsic and intrinsic tasks. We show that our approach against catastrophic works, as the model is able to retrieve the previously trained domain-independent knowledge.
arXiv Detail & Related papers (2022-12-06T13:49:12Z)
What do Large Language Models Learn beyond Language? [10.9650651784511]
We find that pretrained models significantly outperform comparable non-pretrained neural models. Experiments surprisingly reveal that the positive effects of pre-training persist even when pretraining on multi-lingual text or computer code. Our findings suggest a hitherto unexplored deep connection between pre-training and inductive learning abilities of language models.
arXiv Detail & Related papers (2022-10-21T23:43:13Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)
InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective [84.78604733927887]
Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models.
arXiv Detail & Related papers (2020-10-05T20:49:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.