Few-Sample Named Entity Recognition for Security Vulnerability Reports
by Fine-Tuning Pre-Trained Language Models
- URL: http://arxiv.org/abs/2108.06590v1
- Date: Sat, 14 Aug 2021 17:08:03 GMT
- Title: Few-Sample Named Entity Recognition for Security Vulnerability Reports
by Fine-Tuning Pre-Trained Language Models
- Authors: Guanqun Yang, Shay Dineen, Zhipeng Lin, Xueqing Liu
- Abstract summary: Public security vulnerability reports (e.g., CVE reports) play an important role in the maintenance of computer and network systems.
Since these reports are unstructured texts, automatic information extraction (IE) can help scale up the processing.
Existing works on automated IE for security vulnerability reports often rely on a large number of labeled training samples.
- Score: 1.9744907811058785
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Public security vulnerability reports (e.g., CVE reports) play an important
role in the maintenance of computer and network systems. Security companies and
administrators rely on information from these reports to prioritize tasks on
developing and deploying patches to their customers. Since these reports are
unstructured texts, automatic information extraction (IE) can help scale up the
processing by converting the unstructured reports to structured forms, e.g.,
software names and versions and vulnerability types. Existing works on
automated IE for security vulnerability reports often rely on a large number of
labeled training samples. However, creating massive labeled training set is
both expensive and time consuming. In this work, for the first time, we propose
to investigate this problem where only a small number of labeled training
samples are available. In particular, we investigate the performance of
fine-tuning several state-of-the-art pre-trained language models on our small
training dataset. The results show that with pre-trained language models and
carefully tuned hyperparameters, we have reached or slightly outperformed the
state-of-the-art system on this task. Consistent with previous two-step process
of first fine-tuning on main category and then transfer learning to others as
in [7], if otherwise following our proposed approach, the number of required
labeled samples substantially decrease in both stages: 90% reduction in
fine-tuning from 5758 to 576,and 88.8% reduction in transfer learning with 64
labeled samples per category. Our experiments thus demonstrate the
effectiveness of few-sample learning on NER for security vulnerability report.
This result opens up multiple research opportunities for few-sample learning
for security vulnerability reports, which is discussed in the paper. Code:
https://github.com/guanqun-yang/FewVulnerability.
Related papers
- Few-shot learning for security bug report identification [0.5076419064097734]
We propose a few-shot learning-based technique to identify security bug reports using limited labeled data.<n>We employ SetFit, a state-of-the-art few-shot learning framework that combines sentence transformers with contrastive learning and parameter-efficient fine-tuning.<n>Our approach achieves an AUC of 0.865, at best, outperforming traditional ML techniques (baselines) for all of the evaluated datasets.
arXiv Detail & Related papers (2026-01-06T12:29:20Z) - Advancing Vulnerability Classification with BERT: A Multi-Objective Learning Model [0.0]
This paper presents a novel Vulnerability Report that leverages the BERT (Bi Representations from Transformers) model to perform multi-label classification.
The system is deployed via a REST API and a Streamlit UI, enabling real-time vulnerability analysis.
arXiv Detail & Related papers (2025-03-26T06:04:45Z) - Zero-shot prompt-based classification: topic labeling in times of foundation models in German Tweets [1.734165485480267]
We propose a new tool for automatically annotating text using written guidelines without providing training samples.
Our results show that the prompt-based approach is comparable with the fine-tuned BERT but without any annotated training data.
Our findings emphasize the ongoing paradigm shift in the NLP landscape, i.e., the unification of downstream tasks and elimination of the need for pre-labeled training data.
arXiv Detail & Related papers (2024-06-26T10:44:02Z) - To Err is Machine: Vulnerability Detection Challenges LLM Reasoning [8.602355712876815]
We present a challenging code reasoning task: vulnerability detection.
State-of-the-art (SOTA) models reported only 54.5% Balanced Accuracy in our vulnerability detection evaluation.
New models, new training methods, or more execution-specific pretraining data may be needed to conquer vulnerability detection.
arXiv Detail & Related papers (2024-03-25T21:47:36Z) - Zero-shot Retrieval: Augmenting Pre-trained Models with Search Engines [83.65380507372483]
Large pre-trained models can dramatically reduce the amount of task-specific data required to solve a problem, but they often fail to capture domain-specific nuances out of the box.
This paper shows how to leverage recent advances in NLP and multi-modal learning to augment a pre-trained model with search engine retrieval.
arXiv Detail & Related papers (2023-11-29T05:33:28Z) - Text generation for dataset augmentation in security classification
tasks [55.70844429868403]
This study evaluates the application of natural language text generators to fill this data gap in multiple security-related text classification tasks.
We find substantial benefits for GPT-3 data augmentation strategies in situations with severe limitations on known positive-class samples.
arXiv Detail & Related papers (2023-10-22T22:25:14Z) - Uncertainty-aware Parameter-Efficient Self-training for Semi-supervised
Language Understanding [38.11411155621616]
We study self-training as one of the predominant semi-supervised learning approaches.
We present UPET, a novel Uncertainty-aware self-Training framework.
We show that UPET achieves a substantial improvement in terms of performance and efficiency.
arXiv Detail & Related papers (2023-10-19T02:18:29Z) - LIVABLE: Exploring Long-Tailed Classification of Software Vulnerability
Types [18.949810432641772]
We propose a Long-taIled software VulnerABiLity typE classification approach, called LIVABLE.
LIVABLE consists of two modules, including (1) vulnerability representation learning module, which improves the propagation steps in GNN.
A sequence-to-sequence model is also involved to enhance the vulnerability representations.
arXiv Detail & Related papers (2023-06-12T08:14:16Z) - Cross Project Software Vulnerability Detection via Domain Adaptation and
Max-Margin Principle [21.684043656053106]
Software vulnerabilities (SVs) have become a common, serious and crucial concern due to the ubiquity of computer software.
We propose a novel end-to-end approach to tackle these two crucial issues.
Our method obtains a higher performance on F1-measure, the most important measure in SVD, from 1.83% to 6.25% compared to the second highest method in the used datasets.
arXiv Detail & Related papers (2022-09-19T23:47:22Z) - VulBERTa: Simplified Source Code Pre-Training for Vulnerability
Detection [1.256413718364189]
VulBERTa is a deep learning approach to detect security vulnerabilities in source code.
Our approach pre-trains a RoBERTa model with a custom tokenisation pipeline on real-world code from open-source C/C++ projects.
We evaluate our approach on binary and multi-class vulnerability detection tasks across several datasets.
arXiv Detail & Related papers (2022-05-25T00:56:43Z) - Revisiting Self-Training for Few-Shot Learning of Language Model [61.173976954360334]
Unlabeled data carry rich task-relevant information, they are proven useful for few-shot learning of language model.
In this work, we revisit the self-training technique for language model fine-tuning and present a state-of-the-art prompt-based few-shot learner, SFLM.
arXiv Detail & Related papers (2021-10-04T08:51:36Z) - TSDAE: Using Transformer-based Sequential Denoising Auto-Encoder for
Unsupervised Sentence Embedding Learning [53.32740707197856]
We present a new state-of-the-art unsupervised method based on pre-trained Transformers and Sequential Denoising Auto-Encoder (TSDAE)
It can achieve up to 93.1% of the performance of in-domain supervised approaches.
arXiv Detail & Related papers (2021-04-14T17:02:18Z) - Adaptive Self-training for Few-shot Neural Sequence Labeling [55.43109437200101]
We develop techniques to address the label scarcity challenge for neural sequence labeling models.
Self-training serves as an effective mechanism to learn from large amounts of unlabeled data.
meta-learning helps in adaptive sample re-weighting to mitigate error propagation from noisy pseudo-labels.
arXiv Detail & Related papers (2020-10-07T22:29:05Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.