CPE-Identifier: Automated CPE identification and CVE summaries annotation with Deep Learning and NLP
- URL: http://arxiv.org/abs/2405.13568v1
- Date: Wed, 22 May 2024 12:05:17 GMT
- Title: CPE-Identifier: Automated CPE identification and CVE summaries annotation with Deep Learning and NLP
- Authors: Wanyu Hu, Vrizlynn L. L. Thing,
- Abstract summary: We propose the CPE-Identifier system, an automated CPE annotating and extracting system, from the CVE summaries.
The system can be used as a tool to identify CPE entities from new CVE text inputs.
We also apply Natural Language Processing (NLP) Named Entity Recognition (NER) to identify new technical jargons in the text.
- Score: 0.28281736775010774
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the drastic increase in the number of new vulnerabilities in the National Vulnerability Database (NVD) every year, the workload for NVD analysts to associate the Common Platform Enumeration (CPE) with the Common Vulnerabilities and Exposures (CVE) summaries becomes increasingly laborious and slow. The delay causes organisations, which depend on NVD for vulnerability management and security measurement, to be more vulnerable to zero-day attacks. Thus, it is essential to come out with a technique and tool to extract the CPEs in the CVE summaries accurately and quickly. In this work, we propose the CPE-Identifier system, an automated CPE annotating and extracting system, from the CVE summaries. The system can be used as a tool to identify CPE entities from new CVE text inputs. Moreover, we also automate the data generating and labeling processes using deep learning models. Due to the complexity of the CVE texts, new technical terminologies appear frequently. To identify novel words in future CVE texts, we apply Natural Language Processing (NLP) Named Entity Recognition (NER), to identify new technical jargons in the text. Our proposed model achieves an F1 score of 95.48%, an accuracy score of 99.13%, a precision of 94.83%, and a recall of 96.14%. We show that it outperforms prior works on automated CVE-CPE labeling by more than 9% on all metrics.
Related papers
- Failing Forward: Improving Generative Error Correction for ASR with Synthetic Data and Retrieval Augmentation [73.9145653659403]
We show that Generative Error Correction models struggle to generalize beyond the specific types of errors encountered during training.
We propose DARAG, a novel approach designed to improve GEC for ASR in in-domain (ID) and OOD scenarios.
Our approach is simple, scalable, and both domain- and language-agnostic.
arXiv Detail & Related papers (2024-10-17T04:00:29Z) - SCoPE: Evaluating LLMs for Software Vulnerability Detection [0.0]
This work explores and refines the CVEFixes dataset, which is commonly used to train models for code-related tasks.
The output generated by SCoPE was used to create a new version of CVEFixes.
The results show that SCoPE successfully helped to identify 905 duplicates within the evaluated subset.
arXiv Detail & Related papers (2024-07-19T15:02:00Z) - Unveiling Hidden Links Between Unseen Security Entities [3.7138962865789353]
VulnScopper is an innovative approach that utilizes multi-modal representation learning, combining Knowledge Graphs (KG) and Natural Processing (NLP)
We evaluate VulnScopper on two major security datasets, the National Vulnerability Database (NVD) and the Red Hat CVE database.
Our results show that VulnScopper outperforms existing methods, achieving up to 78% Hits@10 accuracy in linking CVEs to Common Vulnerabilities and Exposures (CWEs), and Common Platform Languageions (CPEs)
arXiv Detail & Related papers (2024-03-04T13:14:39Z) - CVE-driven Attack Technique Prediction with Semantic Information
Extraction and a Domain-specific Language Model [2.1756081703276]
The paper introduces the TTPpredictor tool, which uses innovative techniques to analyze CVE descriptions and infer plausible TTP attacks resulting from CVE exploitation.
TTPpredictor overcomes challenges posed by limited labeled data and semantic disparities between CVE and TTP descriptions.
The paper presents an empirical assessment, demonstrating TTPpredictor's effectiveness with accuracy rates of approximately 98% and F1-scores ranging from 95% to 98% in precise CVE classification to ATT&CK techniques.
arXiv Detail & Related papers (2023-09-06T06:53:45Z) - Does Synthetic Data Generation of LLMs Help Clinical Text Mining? [51.205078179427645]
We investigate the potential of OpenAI's ChatGPT to aid in clinical text mining.
We propose a new training paradigm that involves generating a vast quantity of high-quality synthetic data.
Our method has resulted in significant improvements in the performance of downstream tasks.
arXiv Detail & Related papers (2023-03-08T03:56:31Z) - Privacy Adhering Machine Un-learning in NLP [66.17039929803933]
In real world industry use Machine Learning to build models on user data.
Such mandates require effort both in terms of data as well as model retraining.
continuous removal of data and model retraining steps do not scale.
We propose textitMachine Unlearning to tackle this challenge.
arXiv Detail & Related papers (2022-12-19T16:06:45Z) - KnowDA: All-in-One Knowledge Mixture Model for Data Augmentation in
Few-Shot NLP [68.43279384561352]
Existing data augmentation algorithms leverage task-independent rules or fine-tune general-purpose pre-trained language models.
These methods have trivial task-specific knowledge and are limited to yielding low-quality synthetic data for weak baselines in simple tasks.
We propose the Knowledge Mixture Data Augmentation Model (KnowDA): an encoder-decoder LM pretrained on a mixture of diverse NLP tasks.
arXiv Detail & Related papers (2022-06-21T11:34:02Z) - Lexically Aware Semi-Supervised Learning for OCR Post-Correction [90.54336622024299]
Much of the existing linguistic data in many languages of the world is locked away in non-digitized books and documents.
Previous work has demonstrated the utility of neural post-correction methods on recognition of less-well-resourced languages.
We present a semi-supervised learning method that makes it possible to utilize raw images to improve performance.
arXiv Detail & Related papers (2021-11-04T04:39:02Z) - Anomaly Detection Based on Selection and Weighting in Latent Space [73.01328671569759]
We propose a novel selection-and-weighting-based anomaly detection framework called SWAD.
Experiments on both benchmark and real-world datasets have shown the effectiveness and superiority of SWAD.
arXiv Detail & Related papers (2021-03-08T10:56:38Z) - V2W-BERT: A Framework for Effective Hierarchical Multiclass
Classification of Software Vulnerabilities [7.906207218788341]
We present a novel Transformer-based learning framework (V2W-BERT) in this paper.
By using ideas from natural language processing, link prediction and transfer learning, our method outperforms previous approaches.
We achieve up to 97% prediction accuracy for randomly partitioned data and up to 94% prediction accuracy in temporally partitioned data.
arXiv Detail & Related papers (2021-02-23T05:16:57Z) - ThreatZoom: CVE2CWE using Hierarchical Neural Network [4.254099382808598]
One or more CVEs are grouped into the Common Weakness Exposureion (CWE) classes.
Thousands of critical and new CVEs remain unclassified, yet they are unpatchable.
This paper presents the first automatic tool to classify CVEs to CWEs.
arXiv Detail & Related papers (2020-09-24T06:04:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.