Related papers: Exploring Security Commits in Python

Exploring Security Commits in Python

URL: http://arxiv.org/abs/2307.11853v1
Date: Fri, 21 Jul 2023 18:46:45 GMT
Title: Exploring Security Commits in Python
Authors: Shiyu Sun, Shu Wang, Xinda Wang, Yunlong Xing, Elisa Zhang, Kun Sun
Abstract summary: Most security issues in Python have not been indexed by CVE and may only be fixed by'silent' security commits. It is critical to identify the hidden security commits, due to the limited data variety, non-comprehensive code semantics, and uninterpretable learned features. We construct the first security commit dataset in Python, PySecDB, which consists of three subsets including a base dataset, a pilot dataset, and an augmented dataset.
Score: 11.533638656389137
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Python has become the most popular programming language as it is friendly to work with for beginners. However, a recent study has found that most security issues in Python have not been indexed by CVE and may only be fixed by 'silent' security commits, which pose a threat to software security and hinder the security fixes to downstream software. It is critical to identify the hidden security commits; however, the existing datasets and methods are insufficient for security commit detection in Python, due to the limited data variety, non-comprehensive code semantics, and uninterpretable learned features. In this paper, we construct the first security commit dataset in Python, namely PySecDB, which consists of three subsets including a base dataset, a pilot dataset, and an augmented dataset. The base dataset contains the security commits associated with CVE records provided by MITRE. To increase the variety of security commits, we build the pilot dataset from GitHub by filtering keywords within the commit messages. Since not all commits provide commit messages, we further construct the augmented dataset by understanding the semantics of code changes. To build the augmented dataset, we propose a new graph representation named CommitCPG and a multi-attributed graph learning model named SCOPY to identify the security commit candidates through both sequential and structural code semantics. The evaluation shows our proposed algorithms can improve the data collection efficiency by up to 40 percentage points. After manual verification by three security experts, PySecDB consists of 1,258 security commits and 2,791 non-security commits. Furthermore, we conduct an extensive case study on PySecDB and discover four common security fix patterns that cover over 85% of security commits in Python, providing insight into secure software maintenance, vulnerability detection, and automated program repair.

Related papers

Decompiling Smart Contracts with a Large Language Model [51.49197239479266]
Despite Etherscan's 78,047,845 smart contracts deployed on (as of May 26, 2025), a mere 767,520 ( 1%) are open source.<n>This opacity necessitates the automated semantic analysis of on-chain smart contract bytecode.<n>We introduce a pioneering decompilation pipeline that transforms bytecode into human-readable and semantically faithful Solidity code.
arXiv Detail & Related papers (2025-06-24T13:42:59Z)
SWE-PolyBench: A multi-language benchmark for repository level evaluation of coding agents [49.73885480071402]
We introduce SWE-PolyBench, a new benchmark for repository-level, execution-based evaluation of coding agents. SWE-PolyBench contains 2110 instances from 21 repositories and includes tasks in Java (165), JavaScript (1017), TypeScript (729) and Python (199), covering bug fixes, feature additions, and code. Our experiments show that current agents exhibit uneven performances across languages and struggle with complex problems while showing higher performance on simpler tasks.
arXiv Detail & Related papers (2025-04-11T17:08:02Z)
Improving the Context Length and Efficiency of Code Retrieval for Tracing Security Vulnerability Fixes [1.3606495556399092]
A critical task in vulnerability management is tracing the patches that fix a vulnerability. Previous work has shown that the patch information is often missing in vulnerability databases. We propose SITPatchTracer, a scalable full-repo full-context retrieval system.
arXiv Detail & Related papers (2025-03-29T01:53:07Z)
Repository-Level Graph Representation Learning for Enhanced Security Patch Detection [22.039868029497942]
This paper proposes a Repository-level Security Patch Detection framework named RepoSPD. RepoSPD comprises three key components: 1) a repository-level graph construction, RepoCPG, which represents software patches by merging pre-patch and post-patch source code at the repository level; 2) a structure-aware patch representation, which fuses the graph and sequence branch and aims at comprehending the relationship among multiple code changes; and 3) progressive learning, which facilitates the model in balancing semantic and structural information.
arXiv Detail & Related papers (2024-12-11T03:29:56Z)
PyPulse: A Python Library for Biosignal Imputation [58.35269251730328]
We introduce PyPulse, a Python package for imputation of biosignals in both clinical and wearable sensor settings. PyPulse's framework provides a modular and extendable framework with high ease-of-use for a broad userbase, including non-machine-learning bioresearchers. We released PyPulse under the MIT License on Github and PyPI.
arXiv Detail & Related papers (2024-12-09T11:00:55Z)
An Empirical Study of Vulnerability Handling Times in CPython [0.2538209532048867]
The paper examines the handling times of software vulnerabilities in CPython. The paper contributes to the recent effort to better understand security of the Python ecosystem.
arXiv Detail & Related papers (2024-11-01T08:46:14Z)
Secret Breach Prevention in Software Issue Reports [2.8747015994080285]
This paper presents a novel technique for secret breach detection in software issue reports. We highlight the challenges posed by noise, such as log files, URLs, commit IDs, stack traces, and dummy passwords. We propose an approach combining the strengths of state-of-the-artes with the contextual understanding of language models.
arXiv Detail & Related papers (2024-10-31T06:14:17Z)
HexaCoder: Secure Code Generation via Oracle-Guided Synthetic Training Data [60.75578581719921]
Large language models (LLMs) have shown great potential for automatic code generation. Recent studies highlight that many LLM-generated code contains serious security vulnerabilities. We introduce HexaCoder, a novel approach to enhance the ability of LLMs to generate secure codes.
arXiv Detail & Related papers (2024-09-10T12:01:43Z)
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models [66.34505141027624]
We introduce WildTeaming, an automatic LLM safety red-teaming framework that mines in-the-wild user-chatbot interactions to discover 5.7K unique clusters of novel jailbreak tactics. WildTeaming reveals previously unidentified vulnerabilities of frontier LLMs, resulting in up to 4.6x more diverse and successful adversarial attacks.
arXiv Detail & Related papers (2024-06-26T17:31:22Z)
Python Fuzzing for Trustworthy Machine Learning Frameworks [0.0]
We propose a dynamic analysis pipeline for Python projects using Sydr-Fuzz. Our pipeline includes fuzzing, corpus minimization, crash triaging, and coverage collection. To identify the most vulnerable parts of machine learning frameworks, we analyze their potential attack surfaces and develop fuzz targets for PyTorch, and related projects such as h5py.
arXiv Detail & Related papers (2024-03-19T13:41:11Z)
HasTEE+ : Confidential Cloud Computing and Analytics with Haskell [50.994023665559496]
Confidential computing enables the protection of confidential code and data in a co-tenanted cloud deployment using specialized hardware isolation units called Trusted Execution Environments (TEEs) TEEs offer low-level C/C++-based toolchains that are susceptible to inherent memory safety vulnerabilities and lack language constructs to monitor explicit and implicit information-flow leaks. We address the above with HasTEE+, a domain-specific language (cla) embedded in Haskell that enables programming TEEs in a high-level language with strong type-safety.
arXiv Detail & Related papers (2024-01-17T00:56:23Z)
Multi-Granularity Detector for Vulnerability Fixes [13.653249890867222]
We propose MiDas (Multi-Granularity Detector for Vulnerability Fixes) to identify vulnerability-fixing commits. MiDas constructs different neural networks for each level of code change granularity, corresponding to commit-level, file-level, hunk-level, and line-level. MiDas outperforms the current state-of-the-art baseline in terms of AUC by 4.9% and 13.7% on Java and Python-based datasets.
arXiv Detail & Related papers (2023-05-23T10:06:28Z)
CodeLMSec Benchmark: Systematically Evaluating and Finding Security Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks. Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities. This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z)
Pre-trained Encoders in Self-Supervised Learning Improve Secure and Privacy-preserving Supervised Learning [63.45532264721498]
Self-supervised learning is an emerging technique to pre-train encoders using unlabeled data. We perform first systematic, principled measurement study to understand whether and when a pretrained encoder can address the limitations of secure or privacy-preserving supervised learning algorithms.
arXiv Detail & Related papers (2022-12-06T21:35:35Z)
Automated Mapping of Vulnerability Advisories onto their Fix Commits in Open Source Repositories [7.629717457706326]
We present an approach that combines practical experience and machine-learning (ML) An advisory record containing key information about a vulnerability is extracted from an advisory. A subset of candidate fix commits is obtained from the source code repository of the affected project.
arXiv Detail & Related papers (2021-03-24T17:50:35Z)
SafePILCO: a software tool for safe and data-efficient policy synthesis [67.17251247987187]
SafePILCO is a software tool for safe and data-efficient policy search with reinforcement learning. It extends the known PILCO algorithm, originally written in Python, to support safe learning.
arXiv Detail & Related papers (2020-08-07T17:17:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.