OSPtrack: A Labeled Dataset Targeting Simulated Execution of Open-Source Software
- URL: http://arxiv.org/abs/2411.14829v2
- Date: Thu, 28 Nov 2024 10:17:05 GMT
- Title: OSPtrack: A Labeled Dataset Targeting Simulated Execution of Open-Source Software
- Authors: Zhuoran Tan, Christos Anagnosstopoulos, Jeremy Singer,
- Abstract summary: This dataset includes 9,461 package reports, of which 1,962 are identified as malicious.<n>The dataset includes both static and dynamic features such as files, sockets, commands, and DNS records.<n>This dataset supports runtime detection, enhances detection model training, and enables efficient comparative analysis across ecosystems.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Open-source software serves as a foundation for the internet and the cyber supply chain, but its exploitation is becoming increasingly prevalent. While advances in vulnerability detection for OSS have been significant, prior research has largely focused on static code analysis, often neglecting runtime indicators. To address this shortfall, we created a comprehensive dataset spanning five ecosystems, capturing features generated during the execution of packages and libraries in isolated environments. The dataset includes 9,461 package reports, of which 1,962 are identified as malicious, and encompasses both static and dynamic features such as files, sockets, commands, and DNS records. Each report is labeled with verified information and detailed sub-labels for attack types, facilitating the identification of malicious indicators when source code is unavailable. This dataset supports runtime detection, enhances detection model training, and enables efficient comparative analysis across ecosystems, contributing to the strengthening of supply chain security.
Related papers
- Distributed Temporal Graph Learning with Provenance for APT Detection in Supply Chains [4.3627234063853955]
Advanced persistent threats (APTs) frequently leverage supply chain vulnerabilities (SCVs) as entry points.
Current defense strategies primarly focus on blockchain for integrity assurance or detection using plain-text source code analysis in open-source software (OSS)
We propose a novel approach that integrates multi-source data, constructs a comprehensive dynamic graph provenance, and detects APT behavior in real time using temporal graph learning.
arXiv Detail & Related papers (2025-04-03T06:42:26Z) - Enhancing Software Vulnerability Detection Using Code Property Graphs and Convolutional Neural Networks [0.0]
This paper proposes a novel approach to detecting software vulnerabilities using a combination of code property graphs and machine learning techniques.
We introduce various neural network models, including convolutional neural networks adapted for graph data, to process these representations.
Our contributions include a methodology for transforming software code into code property graphs, the implementation of a convolutional neural network model for graph data, and the creation of a comprehensive dataset for training and evaluation.
arXiv Detail & Related papers (2025-03-23T19:12:07Z) - Tracking Down Software Cluster Bombs: A Current State Analysis of the Free/Libre and Open Source Software (FLOSS) Ecosystem [0.43981305860983705]
This study provides a summary of the current state of available FLOSS package repositories.
It addresses the challenge of identifying problematic areas within a software ecosystem.
The results indicate that while there are well-maintained projects within the FLOSS ecosystem, there are also high-impact projects that are susceptible to supply chain attacks.
arXiv Detail & Related papers (2025-02-12T08:57:57Z) - A Novel Approach to Network Traffic Analysis: the HERA tool [0.0]
Cybersecurity threats highlight the need for robust network intrusion detection systems.
These systems rely heavily on datasets to train machine learning models capable of detecting patterns and predicting threats.
HERA is a new open-source tool that generates flow files and labelled or unlabelled datasets with user-defined features.
arXiv Detail & Related papers (2025-01-13T16:47:52Z) - Cross-domain Learning Framework for Tracking Users in RIS-aided Multi-band ISAC Systems with Sparse Labeled Data [55.70071704247794]
Integrated sensing and communications (ISAC) is pivotal for 6G communications and is boosted by the rapid development of reconfigurable intelligent surfaces (RISs)
This paper proposes the X2Track framework, where we model the tracking function by a hierarchical architecture, jointly utilizing multi-modal CSI indicators across multiple bands, and optimize it in a cross-domain manner.
Under X2Track, we design an efficient deep learning algorithm to minimize tracking errors, based on transformer neural networks and adversarial learning techniques.
arXiv Detail & Related papers (2024-05-10T08:04:27Z) - Characterising Payload Entropy in Packet Flows [0.0]
Key technique in early detection is the classification of unusual patterns of network behaviour.
We analyse several large packet datasets to establish baseline payload information entropy values for common network services.
We describe an efficient method for engineering entropy metrics when performing flow recovery from live or offline packet data.
arXiv Detail & Related papers (2024-04-29T21:38:39Z) - DONAPI: Malicious NPM Packages Detector using Behavior Sequence Knowledge Mapping [28.852274185512236]
npm is the most extensive package manager, hosting more than 2 million third-party open-source packages.
In this paper, we synchronize a local package cache containing more than 3.4 million packages in near real-time to give us access to more package code details.
We propose the DONAPI, an automatic malicious npm packages detector that combines static and dynamic analysis.
arXiv Detail & Related papers (2024-03-13T08:38:21Z) - Profile of Vulnerability Remediations in Dependencies Using Graph
Analysis [40.35284812745255]
This research introduces graph analysis methods and a modified Graph Attention Convolutional Neural Network (GAT) model.
We analyze control flow graphs to profile breaking changes in applications occurring from dependency upgrades intended to remediate vulnerabilities.
Results demonstrate the effectiveness of the enhanced GAT model in offering nuanced insights into the relational dynamics of code vulnerabilities.
arXiv Detail & Related papers (2024-03-08T02:01:47Z) - Anomaly Detection Dataset for Industrial Control Systems [1.2234742322758418]
Industrial Control Systems (ICSs) have been targeted by cyberattacks and are becoming increasingly vulnerable.
The lack of suitable datasets for evaluating Machine Learning algorithms is a challenge.
This paper presents the 'ICS-Flow' dataset, which offers network data and process state variables logs for supervised and unsupervised ML-based IDS assessment.
arXiv Detail & Related papers (2023-05-11T14:52:19Z) - Neural Relation Graph: A Unified Framework for Identifying Label Noise
and Outlier Data [44.64190826937705]
We present scalable algorithms for detecting label errors and outlier data based on the relational graph structure of data.
We also introduce a visualization tool that provides contextual information of a data point in the feature-embedded space.
Our approach achieves state-of-the-art detection performance on all tasks considered and demonstrates its effectiveness in large-scale real-world datasets.
arXiv Detail & Related papers (2023-01-29T02:09:13Z) - Malicious Source Code Detection Using Transformer [0.0]
We introduce Malicious Source code Detection using Transformers (MSDT) algorithm.
MSDT is a novel static analysis based on a deep learning method that detects real-world code injection cases to source code packages.
Our algorithm is capable of detecting functions that were injected with malicious code with precision@k values of up to 0.909.
arXiv Detail & Related papers (2022-09-16T14:16:50Z) - Towards Realistic Semi-Supervised Learning [73.59557447798134]
We propose a novel approach to tackle SSL in open-world setting, where we simultaneously learn to classify known and unknown classes.
Our approach substantially outperforms the existing state-of-the-art on seven diverse datasets.
arXiv Detail & Related papers (2022-07-05T19:04:43Z) - VELVET: a noVel Ensemble Learning approach to automatically locate
VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code.
Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph.
VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z) - Extending the WILDS Benchmark for Unsupervised Adaptation [186.90399201508953]
We present the WILDS 2.0 update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data.
These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), and modalities.
We systematically benchmark state-of-the-art methods that leverage unlabeled data, including domain-invariant, self-training, and self-supervised methods.
arXiv Detail & Related papers (2021-12-09T18:32:38Z) - LogLAB: Attention-Based Labeling of Log Data Anomalies via Weak
Supervision [63.08516384181491]
We present LogLAB, a novel modeling approach for automated labeling of log messages without requiring manual work by experts.
Our method relies on estimated failure time windows provided by monitoring systems to produce precise labeled datasets in retrospect.
Our evaluation shows that LogLAB consistently outperforms nine benchmark approaches across three different datasets and maintains an F1-score of more than 0.98 even at large failure time windows.
arXiv Detail & Related papers (2021-11-02T15:16:08Z) - Software Vulnerability Detection via Deep Learning over Disaggregated
Code Graph Representation [57.92972327649165]
This work explores a deep learning approach to automatically learn the insecure patterns from code corpora.
Because code naturally admits graph structures with parsing, we develop a novel graph neural network (GNN) to exploit both the semantic context and structural regularity of a program.
arXiv Detail & Related papers (2021-09-07T21:24:36Z) - Malicious Code Detection: Run Trace Output Analysis by LSTM [0.0]
We propose a methodological framework for detecting malicious code by analyzing run trace outputs by Long Short-Term Memory (LSTM)
We created our dataset from run trace outputs obtained from dynamic analysis of PE files.
Experiments showed that the ISM achieved an accuracy of 87.51% and a false positive rate of 18.34%, while BSM achieved an accuracy of 99.26% and a false positive rate of 2.62%.
arXiv Detail & Related papers (2021-01-14T15:00:42Z) - PyODDS: An End-to-end Outlier Detection System with Automated Machine
Learning [55.32009000204512]
We present PyODDS, an automated end-to-end Python system for Outlier Detection with Database Support.
Specifically, we define the search space in the outlier detection pipeline, and produce a search strategy within the given search space.
It also provides unified interfaces and visualizations for users with or without data science or machine learning background.
arXiv Detail & Related papers (2020-03-12T03:30:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.