Related papers: EvalSVA: Multi-Agent Evaluators for Next-Gen Software Vulnerability Assessment

EvalSVA: Multi-Agent Evaluators for Next-Gen Software Vulnerability Assessment

URL: http://arxiv.org/abs/2501.14737v1
Date: Wed, 11 Dec 2024 08:00:50 GMT
Title: EvalSVA: Multi-Agent Evaluators for Next-Gen Software Vulnerability Assessment
Authors: Xin-Cheng Wen, Jiaxin Ye, Cuiyun Gao, Lianwei Wu, Qing Liao,
Abstract summary: We introduce EvalSVA, a multi-agent evaluators team to autonomously deliberate and evaluate various aspects of software vulnerability (SV) assessment.<n>EvalSVA offers a human-like process and generates both reason and answer for SV assessment.
Score: 17.74561647070259
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Software Vulnerability (SV) assessment is a crucial process of determining different aspects of SVs (e.g., attack vectors and scope) for developers to effectively prioritize efforts in vulnerability mitigation. It presents a challenging and laborious process due to the complexity of SVs and the scarcity of labeled data. To mitigate the above challenges, we introduce EvalSVA, a multi-agent evaluators team to autonomously deliberate and evaluate various aspects of SV assessment. Specifically, we propose a multi-agent-based framework to simulate vulnerability assessment strategies in real-world scenarios, which employs multiple Large Language Models (LLMs) into an integrated group to enhance the effectiveness of SV assessment in the limited data. We also design diverse communication strategies to autonomously discuss and assess different aspects of SV. Furthermore, we construct a multi-lingual SV assessment dataset based on the new standard of CVSS, comprising 699, 888, and 1,310 vulnerability-related commits in C++, Python, and Java, respectively. Our experimental results demonstrate that EvalSVA averagely outperforms the 44.12\% accuracy and 43.29\% F1 for SV assessment compared with the previous methods. It shows that EvalSVA offers a human-like process and generates both reason and answer for SV assessment. EvalSVA can also aid human experts in SV assessment, which provides more explanation and details for SV assessment.

Related papers

REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models [59.445672459851274]
REVAL is a comprehensive benchmark designed to evaluate the textbfREliability and textbfVALue of Large Vision-Language Models.<n>REVAL encompasses over 144K image-text Visual Question Answering (VQA) samples, structured into two primary sections: Reliability and Values.<n>We evaluate 26 models, including mainstream open-source LVLMs and prominent closed-source models like GPT-4o and Gemini-1.5-Pro.
arXiv Detail & Related papers (2025-03-20T07:54:35Z)
HREF: Human Response-Guided Evaluation of Instruction Following in Language Models [61.273153125847166]
We develop a new evaluation benchmark, Human Response-Guided Evaluation of Instruction Following (HREF) In addition to providing reliable evaluation, HREF emphasizes individual task performance and is free from contamination. We study the impact of key design choices in HREF, including the size of the evaluation set, the judge model, the baseline model, and the prompt template.
arXiv Detail & Related papers (2024-12-20T03:26:47Z)
ASVspoof 5: Crowdsourced Speech Data, Deepfakes, and Adversarial Attacks at Scale [59.25180900687571]
ASVspoof 5 is the fifth edition in a series of challenges that promote the study of speech spoofing and deepfake attacks. We describe the two challenge tracks, the new database, the evaluation metrics, and the evaluation platform, and present a summary of the results.
arXiv Detail & Related papers (2024-08-16T13:37:20Z)
Mitigating Data Imbalance for Software Vulnerability Assessment: Does Data Augmentation Help? [0.0]
We show that mitigating data imbalance can significantly improve the predictive performance of models for all the Common Vulnerability Scoring System (CVSS) tasks. We also discover that simple text augmentation like combining random text insertion, deletion, and replacement can outperform the baseline across the board.
arXiv Detail & Related papers (2024-07-15T13:47:55Z)
Unveiling the Achilles' Heel of NLG Evaluators: A Unified Adversarial Framework Driven by Large Language Models [52.368110271614285]
We introduce AdvEval, a novel black-box adversarial framework against NLG evaluators. AdvEval is specially tailored to generate data that yield strong disagreements between human and victim evaluators. We conduct experiments on 12 victim evaluators and 11 NLG datasets, spanning tasks including dialogue, summarization, and question evaluation.
arXiv Detail & Related papers (2024-05-23T14:48:15Z)
Towards single integrated spoofing-aware speaker verification embeddings [63.42889348690095]
This study aims to develop a single integrated spoofing-aware speaker verification embeddings. We analyze that the inferior performance of single SASV embeddings comes from insufficient amount of training data. Experiments show dramatic improvements, achieving a SASV-EER of 1.06% on the evaluation protocol of the SASV2022 challenge.
arXiv Detail & Related papers (2023-05-30T14:15:39Z)
Tackling Spoofing-Aware Speaker Verification with Multi-Model Fusion [88.34134732217416]
This work focuses on fusion-based SASV solutions and proposes a multi-model fusion framework to leverage the power of multiple state-of-the-art ASV and CM models. The proposed framework vastly improves the SASV-EER from 8.75% to 1.17%, which is 86% relative improvement compared to the best baseline system in the SASV challenge.
arXiv Detail & Related papers (2022-06-18T06:41:06Z)
Design Guidelines for Inclusive Speaker Verification Evaluation Datasets [0.6015898117103067]
Speaker verification (SV) provides billions of voice-enabled devices with access control, and ensures the security of voice-driven technologies. Current SV evaluation practices are insufficient for evaluating bias: they are over-simplified and aggregate users, not representative of real-life usage scenarios. This paper proposes design guidelines for constructing SV evaluation datasets that address these short-comings.
arXiv Detail & Related papers (2022-04-05T15:28:26Z)
On the Use of Fine-grained Vulnerable Code Statements for Software Vulnerability Assessment Models [0.0]
We use large-scale data from 1,782 functions of 429 SVs in 200 real-world projects to develop Machine Learning models for function-level SV assessment tasks. We show that vulnerable statements are 5.8 times smaller in size, yet exhibit 7.5-114.5% stronger assessment performance.
arXiv Detail & Related papers (2022-03-16T06:29:40Z)
DeepCVA: Automated Commit-level Vulnerability Assessment with Deep Multi-task Learning [0.0]
We propose a novel Deep multi-task learning model, DeepCVA, to automate seven Commit-level Vulnerability Assessment tasks simultaneously. We conduct large-scale experiments on 1,229 vulnerability-contributing commits containing 542 different SVs in 246 real-world software projects. DeepCVA is the best-performing model with 38% to 59.8% higher Matthews Correlation Coefficient than many supervised and unsupervised baseline models.
arXiv Detail & Related papers (2021-08-18T08:43:36Z)
A Survey on Data-driven Software Vulnerability Assessment and Prioritization [0.0]
Software Vulnerabilities (SVs) are increasing in complexity and scale, posing great security risks to many software systems. Data-driven techniques such as Machine Learning and Deep Learning have taken SV assessment and prioritization to the next level.
arXiv Detail & Related papers (2021-07-18T04:49:22Z)
Tandem Assessment of Spoofing Countermeasures and Automatic Speaker Verification: Fundamentals [59.34844017757795]
The reliability of spoofing countermeasures (CMs) is gauged using the equal error rate (EER) metric. This paper presents several new extensions to the tandem detection cost function (t-DCF) It is hoped that adoption of the t-DCF for the CM assessment will help to foster closer collaboration between the anti-spoofing and ASV research communities.
arXiv Detail & Related papers (2020-07-12T12:44:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.