eyeballvul: a future-proof benchmark for vulnerability detection in the wild
- URL: http://arxiv.org/abs/2407.08708v2
- Date: Sat, 13 Jul 2024 10:44:58 GMT
- Title: eyeballvul: a future-proof benchmark for vulnerability detection in the wild
- Authors: Timothee Chauvin,
- Abstract summary: eyeballvul is a benchmark designed to test the vulnerability detection capabilities of language models at scale.
It is sourced and updated weekly from the stream of published vulnerabilities in open-source repositories.
eyeballvul contains 24,000+ vulnerabilities across 6,000+ revisions and 5,000+ repositories, and is around 55GB in size.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Long contexts of recent LLMs have enabled a new use case: asking models to find security vulnerabilities in entire codebases. To evaluate model performance on this task, we introduce eyeballvul: a benchmark designed to test the vulnerability detection capabilities of language models at scale, that is sourced and updated weekly from the stream of published vulnerabilities in open-source repositories. The benchmark consists of a list of revisions in different repositories, each associated with the list of known vulnerabilities present at that revision. An LLM-based scorer is used to compare the list of possible vulnerabilities returned by a model to the list of known vulnerabilities for each revision. As of July 2024, eyeballvul contains 24,000+ vulnerabilities across 6,000+ revisions and 5,000+ repositories, and is around 55GB in size.
Related papers
- Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses.
Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives.
The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z) - Cybersecurity Defenses: Exploration of CVE Types through Attack Descriptions [1.0474508494260908]
VULDAT is a classification tool using a sentence transformer MPNET to identify system vulnerabilities from attack descriptions.
Our model was applied to 100 attack techniques from the ATT&CK repository and 685 issues from the CVE repository.
Our findings indicate that our model achieves the best performance with F1 score of 0.85, Precision of 0.86, and Recall of 0.83.
arXiv Detail & Related papers (2024-07-09T11:08:35Z) - WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia [59.96425443250666]
Retrieval-augmented generation (RAG) has emerged as a promising solution to mitigate the limitations of large language models (LLMs)
In this work, we conduct a comprehensive evaluation of LLM-generated answers to questions based on contradictory passages from Wikipedia.
We benchmark a diverse range of both closed and open-source LLMs under different QA scenarios, including RAG with a single passage, and RAG with 2 contradictory passages.
arXiv Detail & Related papers (2024-06-19T20:13:42Z) - VersiCode: Towards Version-controllable Code Generation [58.82709231906735]
We introduce VersiCode, the first comprehensive dataset designed to assess the ability of large language models to generate verifiable code for specific library versions.
We design two dedicated evaluation tasks: version-specific code completion (VSCC) and version-aware code editing (VACE)
Comprehensive experiments are conducted to benchmark the performance of LLMs, revealing the challenging nature of these tasks and VersiCode.
arXiv Detail & Related papers (2024-06-11T16:15:06Z) - VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models [12.465060623389151]
This study introduces a new benchmark, VulDetectBench, to assess the vulnerability detection capabilities of Large Language Models (LLMs)
The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty.
Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security.
arXiv Detail & Related papers (2024-06-11T13:42:57Z) - A Comprehensive Study of the Capabilities of Large Language Models for Vulnerability Detection [9.422811525274675]
Large Language Models (LLMs) have demonstrated great potential for code generation and other software engineering tasks.
Vulnerability detection is of crucial importance to maintaining the security, integrity, and trustworthiness of software systems.
Recent work has applied LLMs to vulnerability detection using generic prompting techniques, but their capabilities for this task and the types of errors they make remain unclear.
arXiv Detail & Related papers (2024-03-25T21:47:36Z) - Do-Not-Answer: A Dataset for Evaluating Safeguards in LLMs [59.596335292426105]
This paper collects the first open-source dataset to evaluate safeguards in large language models.
We train several BERT-like classifiers to achieve results comparable with GPT-4 on automatic safety evaluation.
arXiv Detail & Related papers (2023-08-25T14:02:12Z) - On the Security Blind Spots of Software Composition Analysis [46.1389163921338]
We present a novel approach to detect vulnerable clones in the Maven repository.
We retrieve over 53k potential vulnerable clones from Maven Central.
We detect 727 confirmed vulnerable clones and synthesize a testable proof-of-vulnerability project for each of those.
arXiv Detail & Related papers (2023-06-08T20:14:46Z) - Transformer-based Vulnerability Detection in Code at EditTime:
Zero-shot, Few-shot, or Fine-tuning? [5.603751223376071]
We present a practical system that leverages deep learning on a large-scale data set of vulnerable code patterns.
We show that in comparison with state of the art vulnerability detection models our approach improves the state of the art by 10%.
arXiv Detail & Related papers (2023-05-23T01:21:55Z) - CodeLMSec Benchmark: Systematically Evaluating and Finding Security
Vulnerabilities in Black-Box Code Language Models [58.27254444280376]
Large language models (LLMs) for automatic code generation have achieved breakthroughs in several programming tasks.
Training data for these models is usually collected from the Internet (e.g., from open-source repositories) and is likely to contain faults and security vulnerabilities.
This unsanitized training data can cause the language models to learn these vulnerabilities and propagate them during the code generation procedure.
arXiv Detail & Related papers (2023-02-08T11:54:07Z) - CVEfixes: Automated Collection of Vulnerabilities and Their Fixes from
Open-Source Software [0.0]
We implement a fully automated dataset collection tool and share an initial release of the resulting vulnerability dataset named CVEfixes.
The dataset is enriched with meta-data such as programming language, and detailed code and security metrics at five levels of abstraction.
CVEfixes supports various types of data-driven software security research, such as vulnerability prediction, vulnerability classification, vulnerability severity prediction, analysis of vulnerability-related code changes, and automated vulnerability repair.
arXiv Detail & Related papers (2021-07-19T11:34:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.