OSPtrack: A Labeled Dataset Targeting Simulated Execution of Open-Source Software
- URL: http://arxiv.org/abs/2411.14829v2
- Date: Thu, 28 Nov 2024 10:17:05 GMT
- Title: OSPtrack: A Labeled Dataset Targeting Simulated Execution of Open-Source Software
- Authors: Zhuoran Tan, Christos Anagnosstopoulos, Jeremy Singer,
- Abstract summary: This dataset includes 9,461 package reports, of which 1,962 are identified as malicious.
The dataset includes both static and dynamic features such as files, sockets, commands, and DNS records.
This dataset supports runtime detection, enhances detection model training, and enables efficient comparative analysis across ecosystems.
- Score: 0.0
- License:
- Abstract: Open-source software serves as a foundation for the internet and the cyber supply chain, but its exploitation is becoming increasingly prevalent. While advances in vulnerability detection for OSS have been significant, prior research has largely focused on static code analysis, often neglecting runtime indicators. To address this shortfall, we created a comprehensive dataset spanning five ecosystems, capturing features generated during the execution of packages and libraries in isolated environments. The dataset includes 9,461 package reports, of which 1,962 are identified as malicious, and encompasses both static and dynamic features such as files, sockets, commands, and DNS records. Each report is labeled with verified information and detailed sub-labels for attack types, facilitating the identification of malicious indicators when source code is unavailable. This dataset supports runtime detection, enhances detection model training, and enables efficient comparative analysis across ecosystems, contributing to the strengthening of supply chain security.
Related papers
- Tracking Down Software Cluster Bombs: A Current State Analysis of the Free/Libre and Open Source Software (FLOSS) Ecosystem [0.43981305860983705]
This study provides a summary of the current state of available FLOSS package repositories.
It addresses the challenge of identifying problematic areas within a software ecosystem.
The results indicate that while there are well-maintained projects within the FLOSS ecosystem, there are also high-impact projects that are susceptible to supply chain attacks.
arXiv Detail & Related papers (2025-02-12T08:57:57Z) - A Novel Approach to Network Traffic Analysis: the HERA tool [0.0]
Cybersecurity threats highlight the need for robust network intrusion detection systems.
These systems rely heavily on datasets to train machine learning models capable of detecting patterns and predicting threats.
HERA is a new open-source tool that generates flow files and labelled or unlabelled datasets with user-defined features.
arXiv Detail & Related papers (2025-01-13T16:47:52Z) - DFEPT: Data Flow Embedding for Enhancing Pre-Trained Model Based Vulnerability Detection [7.802093464108404]
We propose a data flow embedding technique to enhance the performance of pre-trained models in vulnerability detection tasks.
Specifically, we parse data flow graphs from function-level source code, and use the data type of the variable as the node characteristics of the DFG.
Our research shows that DFEPT can provide effective vulnerability semantic information to pre-trained models, achieving an accuracy of 64.97% on the Devign dataset and an F1-Score of 47.9% on the Reveal dataset.
arXiv Detail & Related papers (2024-10-24T07:05:07Z) - Cross-domain Learning Framework for Tracking Users in RIS-aided Multi-band ISAC Systems with Sparse Labeled Data [55.70071704247794]
Integrated sensing and communications (ISAC) is pivotal for 6G communications and is boosted by the rapid development of reconfigurable intelligent surfaces (RISs)
This paper proposes the X2Track framework, where we model the tracking function by a hierarchical architecture, jointly utilizing multi-modal CSI indicators across multiple bands, and optimize it in a cross-domain manner.
Under X2Track, we design an efficient deep learning algorithm to minimize tracking errors, based on transformer neural networks and adversarial learning techniques.
arXiv Detail & Related papers (2024-05-10T08:04:27Z) - Profile of Vulnerability Remediations in Dependencies Using Graph
Analysis [40.35284812745255]
This research introduces graph analysis methods and a modified Graph Attention Convolutional Neural Network (GAT) model.
We analyze control flow graphs to profile breaking changes in applications occurring from dependency upgrades intended to remediate vulnerabilities.
Results demonstrate the effectiveness of the enhanced GAT model in offering nuanced insights into the relational dynamics of code vulnerabilities.
arXiv Detail & Related papers (2024-03-08T02:01:47Z) - Neural Relation Graph: A Unified Framework for Identifying Label Noise
and Outlier Data [44.64190826937705]
We present scalable algorithms for detecting label errors and outlier data based on the relational graph structure of data.
We also introduce a visualization tool that provides contextual information of a data point in the feature-embedded space.
Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in large-scale real-world datasets.
arXiv Detail & Related papers (2023-01-29T02:09:13Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak
Supervision [63.08516384181491]
We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts.
Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect.
Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.
arXiv Detail & Related papers (2021-11-02T15:16:08Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Multi-Source Data Fusion for Cyberattack Detection in Power Systems [1.8914160585516038]
We show that fusing information from multiple data sources can help identify cyber-induced incidents and reduce false positives.
We perform multi-source data fusion for training IDS in a cyber-physical power system testbed.
Results are presented using the proposed data fusion application to infer False Data and Command injection-based Man-in- The-Middle attacks.
arXiv Detail & Related papers (2021-01-18T06:34:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.