SEDAC: A CVAE-Based Data Augmentation Method for Security Bug Report
Identification
- URL: http://arxiv.org/abs/2401.12060v1
- Date: Mon, 22 Jan 2024 15:53:52 GMT
- Title: SEDAC: A CVAE-Based Data Augmentation Method for Security Bug Report
Identification
- Authors: Y. Liao, T. Zhang
- Abstract summary: In the real world, the ratio of security bug reports is severely low.
SEDAC is a new SBR identification method that generates similar bug report vectors.
It outperforms all the baselines in g-measure with improvements of around 14.24%-50.10%.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Bug tracking systems store many bug reports, some of which are related to
security. Identifying those security bug reports (SBRs) may help us predict
some security-related bugs and solve security issues promptly so that the
project can avoid threats and attacks. However, in the real world, the ratio of
security bug reports is severely low; thus, directly training a prediction
model with raw data may result in inaccurate results. Faced with the massive
challenge of data imbalance, many researchers in the past have attempted to use
text filtering or clustering methods to minimize the proportion of non-security
bug reports (NSBRs) or apply oversampling methods to synthesize SBRs to make
the dataset as balanced as possible. Nevertheless, there are still two
challenges to those methods: 1) They ignore long-distance contextual
information. 2) They fail to generate an utterly balanced dataset. To tackle
these two challenges, we propose SEDAC, a new SBR identification method that
generates similar bug report vectors to solve data imbalance problems and
accurately detect security bug reports. Unlike previous studies, it first
converts bug reports into individual bug report vectors with distilBERT, which
are based on word2vec. Then, it trains a generative model through conditional
variational auto-encoder (CVAE) to generate similar vectors with security
labels, which makes the number of SBRs equal to NSBRs'. Finally, balanced data
are used to train a security bug report classifier. To evaluate the
effectiveness of our framework, we conduct it on 45,940 bug reports from
Chromium and four Apache projects. The experimental results show that SEDAC
outperforms all the baselines in g-measure with improvements of around
14.24%-50.10%.
Related papers
- Data-Free Hard-Label Robustness Stealing Attack [67.41281050467889]
We introduce a novel Data-Free Hard-Label Robustness Stealing (DFHL-RS) attack in this paper.
It enables the stealing of both model accuracy and robustness by simply querying hard labels of the target model.
Our method achieves a clean accuracy of 77.86% and a robust accuracy of 39.51% against AutoAttack.
arXiv Detail & Related papers (2023-12-10T16:14:02Z) - DALA: A Distribution-Aware LoRA-Based Adversarial Attack against
Language Models [64.79319733514266]
Adversarial attacks can introduce subtle perturbations to input data.
Recent attack methods can achieve a relatively high attack success rate (ASR)
We propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method.
arXiv Detail & Related papers (2023-11-14T23:43:47Z) - A Comparative Study of Text Embedding Models for Semantic Text
Similarity in Bug Reports [0.0]
Retrieving similar bug reports from an existing database can help reduce the time and effort required to resolve bugs.
We explored several embedding models such as TF-IDF (Baseline), FastText, Gensim, BERT, and ADA.
Our study provides insights into the effectiveness of different embedding methods for retrieving similar bug reports and highlights the impact of selecting the appropriate one for this task.
arXiv Detail & Related papers (2023-08-17T21:36:56Z) - Auto-labelling of Bug Report using Natural Language Processing [0.0]
Rule and Query-based solutions recommend a long list of potential similar bug reports with no clear ranking.
In this paper, we have proposed a solution using a combination of NLP techniques.
It uses a custom data transformer, a deep neural network, and a non-generalizing machine learning method to retrieve existing identical bug reports.
arXiv Detail & Related papers (2022-12-13T02:32:42Z) - Infrared: A Meta Bug Detector [10.541969253100815]
We propose a new approach, called meta bug detection, which offers three crucial advantages over existing learning-based bug detectors.
Our evaluation shows our meta bug detector (MBD) is effective in catching a variety of bugs including null pointer dereference, array index out-of-bound, file handle leak, and even data races in concurrent programs.
arXiv Detail & Related papers (2022-09-18T09:08:51Z) - Automatic Classification of Bug Reports Based on Multiple Text
Information and Reports' Intention [37.67372105858311]
This paper proposes a new automatic classification method for bug reports.
The innovation is that when categorizing bug reports, in addition to using the text information of the report, the intention of the report is also considered.
Our proposed method achieves better performance and its F-Measure achieves from 87.3% to 95.5%.
arXiv Detail & Related papers (2022-08-02T06:44:51Z) - Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions.
In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data.
We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z) - Early Detection of Security-Relevant Bug Reports using Machine Learning:
How Far Are We? [6.438136820117887]
In a typical maintenance scenario, security-relevant bug reports are prioritised by the development team when preparing corrective patches.
Open security-relevant bug reports can become a critical leak of sensitive information that attackers can leverage to perform zero-day attacks.
In recent years, approaches for the detection of security-relevant bug reports based on machine learning have been reported with promising performance.
arXiv Detail & Related papers (2021-12-19T11:30:29Z) - Learning Stable Classifiers by Transferring Unstable Features [59.06169363181417]
We study transfer learning in the presence of spurious correlations.
We experimentally demonstrate that directly transferring the stable feature extractor learned on the source task may not eliminate these biases for the target task.
We hypothesize that the unstable features in the source task and those in the target task are directly related.
arXiv Detail & Related papers (2021-06-15T02:41:12Z) - S3M: Siamese Stack (Trace) Similarity Measure [55.58269472099399]
We present S3M -- the first approach to computing stack trace similarity based on deep learning.
It is based on a biLSTM encoder and a fully-connected classifier to compute similarity.
Our experiments demonstrate the superiority of our approach over the state-of-the-art on both open-sourced data and a private JetBrains dataset.
arXiv Detail & Related papers (2021-03-18T21:10:41Z) - D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using
Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools.
We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.