Exploring Security Commits in Python
- URL: http://arxiv.org/abs/2307.11853v1
- Date: Fri, 21 Jul 2023 18:46:45 GMT
- Title: Exploring Security Commits in Python
- Authors: Shiyu Sun, Shu Wang, Xinda Wang, Yunlong Xing, Elisa Zhang, Kun Sun
- Abstract summary: Most security issues in Python have not been indexed by CVE and may only be fixed by'silent' security commits.
It is critical to identify the hidden security commits, due to the limited data variety, non-comprehensive code semantics, and uninterpretable learned features.
We construct the first security commit dataset in Python, PySecDB, which consists of three subsets including a base dataset, a pilot dataset, and an augmented dataset.
- Score: 11.533638656389137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Python has become the most popular programming language as it is friendly to
work with for beginners. However, a recent study has found that most security
issues in Python have not been indexed by CVE and may only be fixed by 'silent'
security commits, which pose a threat to software security and hinder the
security fixes to downstream software. It is critical to identify the hidden
security commits; however, the existing datasets and methods are insufficient
for security commit detection in Python, due to the limited data variety,
non-comprehensive code semantics, and uninterpretable learned features. In this
paper, we construct the first security commit dataset in Python, namely
PySecDB, which consists of three subsets including a base dataset, a pilot
dataset, and an augmented dataset. The base dataset contains the security
commits associated with CVE records provided by MITRE. To increase the variety
of security commits, we build the pilot dataset from GitHub by filtering
keywords within the commit messages. Since not all commits provide commit
messages, we further construct the augmented dataset by understanding the
semantics of code changes. To build the augmented dataset, we propose a new
graph representation named CommitCPG and a multi-attributed graph learning
model named SCOPY to identify the security commit candidates through both
sequential and structural code semantics. The evaluation shows our proposed
algorithms can improve the data collection efficiency by up to 40 percentage
points. After manual verification by three security experts, PySecDB consists
of 1,258 security commits and 2,791 non-security commits. Furthermore, we
conduct an extensive case study on PySecDB and discover four common security
fix patterns that cover over 85% of security commits in Python, providing
insight into secure software maintenance, vulnerability detection, and
automated program repair.
Related papers
- An Empirical Study of Vulnerability Handling Times in CPython [0.2538209532048867]
The paper examines the handling times of software vulnerabilities in CPython.
The paper contributes to the recent effort to better understand security of the Python ecosystem.
arXiv Detail & Related papers (2024-11-01T08:46:14Z) - Secret Breach Prevention in Software Issue Reports [2.8747015994080285]
This paper presents a novel technique for secret breach detection in software issue reports.
We highlight the challenges posed by noise, such as log files, URLs, commit IDs, stack traces, and dummy passwords.
We propose an approach combining the strengths of state-of-the-artes with the contextual understanding of language models.
arXiv Detail & Related papers (2024-10-31T06:14:17Z) - HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation.
Recent studies highlight that many LLM-generated code contains serious security vulnerabilities.
We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z) - WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics.
WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z) - Python Fuzzing for Trustworthy Machine Learning Frameworks [0.0]
We propose a dynamic analysis pipeline for Python projects using Sydr-Fuzz.
Our pipeline includes fuzzing, corpus minimization, crash triaging, and coverage collection.
To identify the most vulnerable parts of machine learning frameworks, we analyze their potential attack surfaces and develop fuzz targets for PyTorch, and related projects such as h5py.
arXiv Detail & Related papers (2024-03-19T13:41:11Z) - HasTEE+ : Confidential Cloud Computing and Analytics with Haskell [50.994023665559496]
Confidential computing enables the protection of confidential code and data in a co-tenanted cloud deployment using specialized hardware isolation units called Trusted Execution Environments (TEEs)
TEEs offer low-level C/C++-based toolchains that are susceptible to inherent memory safety vulnerabilities and lack language constructs to monitor explicit and implicit information-flow leaks.
We address the above with HasTEE+, a domain-specific language (cla) embedded in Haskell that enables programming TEEs in a high-level language with strong type-safety.
arXiv Detail & Related papers (2024-01-17T00:56:23Z) - Multi-Granularity Detector for Vulnerability Fixes [13.653249890867222]
We propose MiDas (Multi-Granularity Detector for Vulnerability Fixes) to identify vulnerability-fixing commits.
MiDas constructs different neural networks for each level of code change granularity, corresponding to commit-level, file-level, hunk-level, and line-level.
MiDas outperforms the current state-of-the-art baseline in terms of AUC by 4.9% and 13.7% on Java and Python-based datasets.
arXiv Detail & Related papers (2023-05-23T10:06:28Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - Pre-trained Encoders in Self-Supervised Learning Improve Secure and
Privacy-preserving Supervised Learning [63.45532264721498]
Self-supervised learning is an emerging technique to pre-train encoders using unlabeled data.
We perform first systematic, principled measurement study to understand whether and when a pretrained encoder can address the limitations of secure or privacy-preserving supervised learning algorithms.
arXiv Detail & Related papers (2022-12-06T21:35:35Z) - Automated Mapping of Vulnerability Advisories onto their Fix Commits in
Open Source Repositories [7.629717457706326]
We present an approach that combines practical experience and machine-learning (ML)
An advisory record containing key information about a vulnerability is extracted from an advisory.
A subset of candidate fix commits is obtained from the source code repository of the affected project.
arXiv Detail & Related papers (2021-03-24T17:50:35Z) - SafePILCO: a software tool for safe and data-efficient policy synthesis [67.17251247987187]
SafePILCO is a software tool for safe and data-efficient policy search with reinforcement learning.
It extends the known PILCO algorithm, originally written in Python, to support safe learning.
arXiv Detail & Related papers (2020-08-07T17:17:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.