Related papers: FORGE: An LLM-driven Framework for Large-Scale Smart Contract Vulnerability Dataset Construction

FORGE: An LLM-driven Framework for Large-Scale Smart Contract Vulnerability Dataset Construction

URL: http://arxiv.org/abs/2506.18795v1
Date: Mon, 23 Jun 2025 16:03:16 GMT
Title: FORGE: An LLM-driven Framework for Large-Scale Smart Contract Vulnerability Dataset Construction
Authors: Jiachi Chen, Yiming Shen, Jiashuo Zhang, Zihao Li, John Grundy, Zhenzhe Shao, Yanlin Wang, Jiashui Wang, Ting Chen, Zibin Zheng,
Abstract summary: FORGE is the first automated approach for constructing smart contract vulnerability datasets.<n>We generate a dataset comprising 81,390 solidity files and 27,497 vulnerability findings across 296 CWE categories.<n>Results reveal the significant limitations in current detection capabilities.
Score: 34.20628333535654
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-quality smart contract vulnerability datasets are critical for evaluating security tools and advancing smart contract security research. Two major limitations of current manual dataset construction are (1) labor-intensive and error-prone annotation processes limiting the scale, quality, and evolution of the dataset, and (2) absence of standardized classification rules results in inconsistent vulnerability categories and labeling results across different datasets. To address these limitations, we present FORGE, the first automated approach for constructing smart contract vulnerability datasets. FORGE leverages an LLM-driven pipeline to extract high-quality vulnerabilities from real-world audit reports and classify them according to the CWE, the most widely recognized classification in software security. FORGE employs a divide-and-conquer strategy to extract structured and self-contained vulnerability information from these reports. Additionally, it uses a tree-of-thoughts technique to classify the vulnerability information into the hierarchical CWE classification. To evaluate FORGE's effectiveness, we run FORGE on 6,454 real-world audit reports and generate a dataset comprising 81,390 solidity files and 27,497 vulnerability findings across 296 CWE categories. Manual assessment of the dataset demonstrates high extraction precision and classification consistency with human experts (precision of 95.6% and inter-rater agreement k-$\alpha$ of 0.87). We further validate the practicality of our dataset by benchmarking 13 existing security tools on our dataset. The results reveal the significant limitations in current detection capabilities. Furthermore, by analyzing the severity-frequency distribution patterns through a unified CWE perspective in our dataset, we highlight inconsistency between current smart contract research focus and priorities identified from real-world vulnerabilities...

Related papers

Cross-Dataset Semantic Segmentation Performance Analysis: Unifying NIST Point Cloud City Datasets for 3D Deep Learning [49.1574468325115]
This study analyzes semantic segmentation performance across heterogeneously labeled point-cloud datasets relevant to public safety applications.<n>Key identified challenges include insufficient labeled data, difficulties in unifying class labels across datasets, and the need for standardization.
arXiv Detail & Related papers (2025-08-01T17:59:02Z)
DATABench: Evaluating Dataset Auditing in Deep Learning from an Adversarial Perspective [59.66984417026933]
We introduce a novel taxonomy, classifying existing methods based on their reliance on internal features (IF) (inherent to the data) versus external features (EF) (artificially introduced for auditing)<n>We formulate two primary attack types: evasion attacks, designed to conceal the use of a dataset, and forgery attacks, intending to falsely implicate an unused dataset.<n>Building on the understanding of existing methods and attack objectives, we further propose systematic attack strategies: decoupling, removal, and detection for evasion; adversarial example-based methods for forgery.<n>Our benchmark, DATABench, comprises 17 evasion attacks, 5 forgery attacks, and 9
arXiv Detail & Related papers (2025-07-08T03:07:15Z)
CyberGym: Evaluating AI Agents' Cybersecurity Capabilities with Real-World Vulnerabilities at Scale [46.76144797837242]
Large language model (LLM) agents are becoming increasingly skilled at handling cybersecurity tasks autonomously.<n>Existing benchmarks fall short, often failing to capture real-world scenarios or being limited in scope.<n>We introduce CyberGym, a large-scale and high-quality cybersecurity evaluation framework featuring 1,507 real-world vulnerabilities.
arXiv Detail & Related papers (2025-06-03T07:35:14Z)
SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection [8.440793630384546]
Large Language Models (LLMs) have shown promise in software engineering tasks.<n> evaluating their effectiveness in vulnerability detection is challenging due to the lack of high-quality datasets.<n>This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024.
arXiv Detail & Related papers (2025-05-26T11:06:03Z)
An Empirical Analysis of Vulnerability Detection Tools for Solidity Smart Contracts Using Line Level Manually Annotated Vulnerabilities [5.357551358237259]
This paper provides an empirical evaluation of automated vulnerability analysis tools specifically designed for Solidity smart contracts.<n>We conducted an assessment using an annotated dataset of 2,182 instances we manually annotated with line-level vulnerability labels.<n>We identified a set of 3 tools that, combined, achieve up to 76.78% found vulnerabilities taking less than one minute to run.
arXiv Detail & Related papers (2025-05-21T17:01:18Z)
Streamlining Security Vulnerability Triage with Large Language Models [0.786186571320448]
We present CASEY, a novel approach that automates the identification of Common Weaknessions (CWEs) of security bugs and assesses their severity.<n>Casey achieved a CWE identification accuracy of 68%, a severity identification accuracy of 73.6%, and a combined accuracy of 51.2%.
arXiv Detail & Related papers (2025-01-31T06:02:24Z)
CTINexus: Automatic Cyber Threat Intelligence Knowledge Graph Construction Using Large Language Models [49.657358248788945]
Textual descriptions in cyber threat intelligence (CTI) reports are rich sources of knowledge about cyber threats.<n>Current CTI knowledge extraction methods lack flexibility and generalizability.<n>We propose CTINexus, a novel framework for data-efficient CTI knowledge extraction and high-quality cybersecurity knowledge graph (CSKG) construction.
arXiv Detail & Related papers (2024-10-28T14:18:32Z)
Vulnerability Detection with Code Language Models: How Far Are We? [40.455600722638906]
PrimeVul is a new dataset for training and evaluating code LMs for vulnerability detection. It incorporates a novel set of data labeling techniques that achieve comparable label accuracy to human-verified benchmarks. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues.
arXiv Detail & Related papers (2024-03-27T14:34:29Z)
Model Stealing Attack against Graph Classification with Authenticity, Uncertainty and Diversity [80.16488817177182]
GNNs are vulnerable to the model stealing attack, a nefarious endeavor geared towards duplicating the target model via query permissions. We introduce three model stealing attacks to adapt to different actual scenarios.
arXiv Detail & Related papers (2023-12-18T05:42:31Z)
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets. Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z)
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements [62.93814803258067]
This paper presents VELVET, a novel ensemble learning approach to locate vulnerable statements in source code. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph. VELVET achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively.
arXiv Detail & Related papers (2021-12-20T22:45:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.