Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models
- URL: http://arxiv.org/abs/2407.08969v1
- Date: Fri, 12 Jul 2024 03:33:13 GMT
- Title: Detect Llama -- Finding Vulnerabilities in Smart Contracts using Large Language Models
- Authors: Peter Ince, Xiapu Luo, Jiangshan Yu, Joseph K. Liu, Xiaoning Du,
- Abstract summary: We fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection.
For binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of.
For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo.
- Score: 27.675558033502565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we test the hypothesis that although OpenAI's GPT-4 performs well generally, we can fine-tune open-source models to outperform GPT-4 in smart contract vulnerability detection. We fine-tune two models from Meta's Code Llama and a dataset of 17k prompts, Detect Llama - Foundation and Detect Llama - Instruct, and we also fine-tune OpenAI's GPT-3.5 Turbo model (GPT-3.5FT). We then evaluate these models, plus a random baseline, on a testset we develop against GPT-4, and GPT-4 Turbo's, detection of eight vulnerabilities from the dataset and the two top identified vulnerabilities - and their weighted F1 scores. We find that for binary classification (i.e., is this smart contract vulnerable?), our two best-performing models, GPT-3.5FT and Detect Llama - Foundation, achieve F1 scores of $0.776$ and $0.68$, outperforming both GPT-4 and GPT-4 Turbo, $0.66$ and $0.675$. For the evaluation against individual vulnerability identification, our top two models, GPT-3.5FT and Detect Llama - Foundation, both significantly outperformed GPT-4 and GPT-4 Turbo in both weighted F1 for all vulnerabilities ($0.61$ and $0.56$ respectively against GPT-4's $0.218$ and GPT-4 Turbo's $0.243$) and weighted F1 for the top two identified vulnerabilities ($0.719$ for GPT-3.5FT, $0.674$ for Detect Llama - Foundation against GPT-4's $0.363$ and GPT-4 Turbo's $0.429$).
Related papers
- SIEVE: General Purpose Data Filtering System Matching GPT-4o Accuracy at 1% the Cost [8.406910685074134]
SIEVE is a lightweight filter that matches GPT-4o accuracy at a fraction of the cost.
We experimentally validate SIEVE on the OpenWebText dataset, using five highly customized filter tasks.
Our results demonstrate the effectiveness and efficiency of our method in curating large, high-quality datasets for language model training.
arXiv Detail & Related papers (2024-10-03T17:58:29Z) - Detection Made Easy: Potentials of Large Language Models for Solidity Vulnerabilities [2.6499018693213316]
This paper presents a comprehensive investigation of the use of large language models (LLMs) and their capabilities in detecting Top Ten vulnerabilities in Solidity.
We introduce a novel, class-balanced, structured, and labeled dataset named VulSmart, which we use benchmark and compare the performance of open-source LLMs like CodeLlama, Llama2, CodeT5 and Falcon, alongside closed-source models like GPT-3.5 Turbo and GPT-4o Mini.
Our findings reveal that SmartVD outperforms its open-source counterparts and even exceeds the performance of closed-source base models like GPT-3.5 and GP
arXiv Detail & Related papers (2024-09-15T13:16:58Z) - Unveiling the Safety of GPT-4o: An Empirical Study using Jailbreak Attacks [65.84623493488633]
This paper conducts a rigorous evaluation of GPT-4o against jailbreak attacks.
The newly introduced audio modality opens up new attack vectors for jailbreak attacks on GPT-4o.
Existing black-box multimodal jailbreak attack methods are largely ineffective against GPT-4o and GPT-4V.
arXiv Detail & Related papers (2024-06-10T14:18:56Z) - GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition? [82.40761196684524]
This paper centers on the evaluation of GPT-4's linguistic and visual capabilities in zero-shot visual recognition tasks.
We conduct extensive experiments to evaluate GPT-4's performance across images, videos, and point clouds.
Our findings show that GPT-4, enhanced with rich linguistic descriptions, significantly improves zero-shot recognition.
arXiv Detail & Related papers (2023-11-27T11:29:10Z) - ExtractGPT: Exploring the Potential of Large Language Models for Product Attribute Value Extraction [52.14681890859275]
E-commerce platforms require structured product data in the form of attribute-value pairs.
BERT-based extraction methods require large amounts of task-specific training data.
This paper explores using large language models (LLMs) as a more training-data efficient and robust alternative.
arXiv Detail & Related papers (2023-10-19T07:39:00Z) - GPTScan: Detecting Logic Vulnerabilities in Smart Contracts by Combining GPT with Program Analysis [26.081673382969615]
We propose GPTScan, the first tool combining GPT with static analysis for smart contract logic vulnerability detection.
By breaking down each logic vulnerability type into scenarios and properties, GPTScan matches candidate vulnerabilities with GPT.
It effectively detects ground-truth logic vulnerabilities with a recall of over 70%, including 9 new vulnerabilities missed by human auditors.
arXiv Detail & Related papers (2023-08-07T05:48:53Z) - SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its
Departure from Current Machine Learning [5.177947445379688]
This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis.
Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification.
The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations.
arXiv Detail & Related papers (2023-07-16T05:33:35Z) - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.