Related papers: SEDAC: A CVAE-Based Data Augmentation Method for Security Bug Report Identification

SEDAC: A CVAE-Based Data Augmentation Method for Security Bug Report Identification

URL: http://arxiv.org/abs/2401.12060v1
Date: Mon, 22 Jan 2024 15:53:52 GMT
Title: SEDAC: A CVAE-Based Data Augmentation Method for Security Bug Report Identification
Authors: Y. Liao, T. Zhang
Abstract summary: In the real world, the ratio of security bug reports is severely low. SEDAC is a new SBR identification method that generates similar bug report vectors. It outperforms all the baselines in g-measure with improvements of around 14.24%-50.10%.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Bug tracking systems store many bug reports, some of which are related to security. Identifying those security bug reports (SBRs) may help us predict some security-related bugs and solve security issues promptly so that the project can avoid threats and attacks. However, in the real world, the ratio of security bug reports is severely low; thus, directly training a prediction model with raw data may result in inaccurate results. Faced with the massive challenge of data imbalance, many researchers in the past have attempted to use text filtering or clustering methods to minimize the proportion of non-security bug reports (NSBRs) or apply oversampling methods to synthesize SBRs to make the dataset as balanced as possible. Nevertheless, there are still two challenges to those methods: 1) They ignore long-distance contextual information. 2) They fail to generate an utterly balanced dataset. To tackle these two challenges, we propose SEDAC, a new SBR identification method that generates similar bug report vectors to solve data imbalance problems and accurately detect security bug reports. Unlike previous studies, it first converts bug reports into individual bug report vectors with distilBERT, which are based on word2vec. Then, it trains a generative model through conditional variational auto-encoder (CVAE) to generate similar vectors with security labels, which makes the number of SBRs equal to NSBRs'. Finally, balanced data are used to train a security bug report classifier. To evaluate the effectiveness of our framework, we conduct it on 45,940 bug reports from Chromium and four Apache projects. The experimental results show that SEDAC outperforms all the baselines in g-measure with improvements of around 14.24%-50.10%.

Related papers

Automated Duplicate Bug Report Detection in Large Open Bug Repositories [3.481985817302898]
Many users and contributors of large open-source projects report software defects or enhancement requests (known as bug reports) to the issue-tracking systems. We propose a novel approach based on machine learning methods that can automatically detect duplicate bug reports in an open bug repository.
arXiv Detail & Related papers (2025-04-21T01:55:54Z)
GitBugs: Bug Reports for Duplicate Detection, Retrieval Augmented Generation, Triage, and More [0.0]
We present GitBugs, a comprehen-sive and up-to-date dataset of over 150,000 bug reports from nine actively maintained open-source projects. GitBugs aggregates data from Github, Bugzilla and Jira issue trackers, offering standardized categorical fields for classification tasks. It includes ex- ploratory analysis notebooks and detailed project-level statistics, such as duplicate rates and resolution times.
arXiv Detail & Related papers (2025-04-13T16:55:28Z)
Data-Free Hard-Label Robustness Stealing Attack [67.41281050467889]
We introduce a novel Data-Free Hard-Label Robustness Stealing (DFHL-RS) attack in this paper. It enables the stealing of both model accuracy and robustness by simply querying hard labels of the target model. Our method achieves a clean accuracy of 77.86% and a robust accuracy of 39.51% against AutoAttack.
arXiv Detail & Related papers (2023-12-10T16:14:02Z)
DALA: A Distribution-Aware LoRA-Based Adversarial Attack against Language Models [64.79319733514266]
Adversarial attacks can introduce subtle perturbations to input data. Recent attack methods can achieve a relatively high attack success rate (ASR) We propose a Distribution-Aware LoRA-based Adversarial Attack (DALA) method.
arXiv Detail & Related papers (2023-11-14T23:43:47Z)
A Comparative Study of Text Embedding Models for Semantic Text Similarity in Bug Reports [0.0]
Retrieving similar bug reports from an existing database can help reduce the time and effort required to resolve bugs. We explored several embedding models such as TF-IDF (Baseline), FastText, Gensim, BERT, and ADA. Our study provides insights into the effectiveness of different embedding methods for retrieving similar bug reports and highlights the impact of selecting the appropriate one for this task.
arXiv Detail & Related papers (2023-08-17T21:36:56Z)
Auto-labelling of Bug Report using Natural Language Processing [0.0]
Rule and Query-based solutions recommend a long list of potential similar bug reports with no clear ranking. In this paper, we have proposed a solution using a combination of NLP techniques. It uses a custom data transformer, a deep neural network, and a non-generalizing machine learning method to retrieve existing identical bug reports.
arXiv Detail & Related papers (2022-12-13T02:32:42Z)
Infrared: A Meta Bug Detector [10.541969253100815]
We propose a new approach, called meta bug detection, which offers three crucial advantages over existing learning-based bug detectors. Our evaluation shows our meta bug detector (MBD) is effective in catching a variety of bugs including null pointer dereference, array index out-of-bound, file handle leak, and even data races in concurrent programs.
arXiv Detail & Related papers (2022-09-18T09:08:51Z)
Automatic Classification of Bug Reports Based on Multiple Text Information and Reports' Intention [37.67372105858311]
This paper proposes a new automatic classification method for bug reports. The innovation is that when categorizing bug reports, in addition to using the text information of the report, the intention of the report is also considered. Our proposed method achieves better performance and its F-Measure achieves from 87.3% to 95.5%.
arXiv Detail & Related papers (2022-08-02T06:44:51Z)
Leveraging Unlabeled Data to Predict Out-of-Distribution Performance [63.740181251997306]
Real-world machine learning deployments are characterized by mismatches between the source (training) and target (test) distributions. In this work, we investigate methods for predicting the target domain accuracy using only labeled source data and unlabeled target data. We propose Average Thresholded Confidence (ATC), a practical method that learns a threshold on the model's confidence, predicting accuracy as the fraction of unlabeled examples.
arXiv Detail & Related papers (2022-01-11T23:01:12Z)
Early Detection of Security-Relevant Bug Reports using Machine Learning: How Far Are We? [6.438136820117887]
In a typical maintenance scenario, security-relevant bug reports are prioritised by the development team when preparing corrective patches. Open security-relevant bug reports can become a critical leak of sensitive information that attackers can leverage to perform zero-day attacks. In recent years, approaches for the detection of security-relevant bug reports based on machine learning have been reported with promising performance.
arXiv Detail & Related papers (2021-12-19T11:30:29Z)
Learning Stable Classifiers by Transferring Unstable Features [59.06169363181417]
We study transfer learning in the presence of spurious correlations. We experimentally demonstrate that directly transferring the stable feature extractor learned on the source task may not eliminate these biases for the target task. We hypothesize that the unstable features in the source task and those in the target task are directly related.
arXiv Detail & Related papers (2021-06-15T02:41:12Z)
S3M: Siamese Stack (Trace) Similarity Measure [55.58269472099399]
We present S3M -- the first approach to computing stack trace similarity based on deep learning. It is based on a biLSTM encoder and a fully-connected classifier to compute similarity. Our experiments demonstrate the superiority of our approach over the state-of-the-art on both open-sourced data and a private JetBrains dataset.
arXiv Detail & Related papers (2021-03-18T21:10:41Z)
D2A: A Dataset Built for AI-Based Vulnerability Detection Methods Using Differential Analysis [55.15995704119158]
We propose D2A, a differential analysis based approach to label issues reported by static analysis tools. We use D2A to generate a large labeled dataset to train models for vulnerability identification.
arXiv Detail & Related papers (2021-02-16T07:46:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.