Related papers: SeBERTis: A Framework for Producing Classifiers of Security-Related Issue Reports

SeBERTis: A Framework for Producing Classifiers of Security-Related Issue Reports

URL: http://arxiv.org/abs/2512.15003v1
Date: Wed, 17 Dec 2025 01:23:11 GMT
Title: SeBERTis: A Framework for Producing Classifiers of Security-Related Issue Reports
Authors: Sogol Masoumzadeh, Yufei Li, Shane McIntosh, Dániel Varró, Lili Wei,
Abstract summary: SEBERTIS is a framework to train Deep Neural Networks (DNNs) as classifiers independent of lexical cues.<n>Our framework achieves a 0.9880 F1-score in detecting security-related issues of a curated corpus of 10,000 GitHub issue reports.
Score: 8.545800179148442
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Monitoring issue tracker submissions is a crucial software maintenance activity. A key goal is the prioritization of high risk, security-related bugs. If such bugs can be recognized early, the risk of propagation to dependent products and endangerment of stakeholder benefits can be mitigated. To assist triage engineers with this task, several automatic detection techniques, from Machine Learning (ML) models to prompting Large Language Models (LLMs), have been proposed. Although promising to some extent, prior techniques often memorize lexical cues as decision shortcuts, yielding low detection rate specifically for more complex submissions. As such, these classifiers do not yet reach the practical expectations of a real-time detector of security-related issues. To address these limitations, we propose SEBERTIS, a framework to train Deep Neural Networks (DNNs) as classifiers independent of lexical cues, so that they can confidently detect fully unseen security-related issues. SEBERTIS capitalizes on fine-tuning bidirectional transformer architectures as Masked Language Models (MLMs) on a series of semantically equivalent vocabulary to prediction labels (which we call Semantic Surrogates) when they have been replaced with a mask. Our SEBERTIS-trained classifier achieves a 0.9880 F1-score in detecting security-related issues of a curated corpus of 10,000 GitHub issue reports, substantially outperforming state-of-the-art issue classifiers, with 14.44%-96.98%, 15.40%-93.07%, and 14.90%-94.72% higher detection precision, recall, and F1-score over ML-based baselines. Our classifier also substantially surpasses LLM baselines, with an improvement of 23.20%-63.71%, 36.68%-85.63%, and 39.49%-74.53% for precision, recall, and F1-score.

Related papers

Favia: Forensic Agent for Vulnerability-fix Identification and Analysis [5.43098755190303]
We propose Favia, a forensic, agent-based framework for vulnerability-fix identification.<n>Favia combines scalable candidate ranking with deep and iterative semantic reasoning.<n>We evaluate Favia on CVEVC, a large-scale dataset we made that comprises over 8 million commits from 3,708 real-world repositories.
arXiv Detail & Related papers (2026-02-13T00:51:22Z)
VulAgent: Hypothesis-Validation based Multi-Agent Vulnerability Detection [55.957275374847484]
VulAgent is a multi-agent vulnerability detection framework based on hypothesis validation.<n>It implements a semantics-sensitive, multi-view detection pipeline, each aligned to a specific analysis perspective.<n>On average, VulAgent improves overall accuracy by 6.6%, increases the correct identification rate of vulnerable--fixed code pairs by up to 450%, and reduces the false positive rate by about 36%.
arXiv Detail & Related papers (2025-09-15T02:25:38Z)
Large Language Models Versus Static Code Analysis Tools: A Systematic Benchmark for Vulnerability Detection [0.0]
Three industry-standard rule-based static code-analysis tools (Sonar, CodeQL and Snyk Code) and three state-of-the-art large language models hosted on the GitHub Models platform (GPT-4.1, Mistral Large and DeepSeek V3) were evaluated.<n>Using a curated suite of ten real-world C# projects that embed 63 vulnerabilities, we measure classical accuracy (precision, recall, F-score), analysis latency, granularity and the developer effort required to vet true positives.<n>We recommend a hybrid pipeline: employ language models early in development for broad, context-aware detection and
arXiv Detail & Related papers (2025-08-06T13:48:38Z)
Are Sparse Autoencoders Useful for Java Function Bug Detection? [5.119371135458389]
Software vulnerabilities are a major source of security breaches.<n>Traditional methods for vulnerability detection are limited by high false positive rates, scalability issues, and reliance on manual effort.<n>Sparse Autoencoder offer a promising solution to this problem.
arXiv Detail & Related papers (2025-05-15T14:59:17Z)
SafeMLRM: Demystifying Safety in Multi-modal Large Reasoning Models [50.34706204154244]
Acquiring reasoning capabilities catastrophically degrades inherited safety alignment.<n>Certain scenarios suffer 25 times higher attack rates.<n>Despite tight reasoning-answer safety coupling, MLRMs demonstrate nascent self-correction.
arXiv Detail & Related papers (2025-04-09T06:53:23Z)
Palisade -- Prompt Injection Detection Framework [0.9620910657090188]
Large Language Models are vulnerable to malicious prompt injection attacks. This paper proposes a novel NLP based approach for prompt injection detection. It emphasizes accuracy and optimization through a layered input screening process.
arXiv Detail & Related papers (2024-10-28T15:47:03Z)
SeCodePLT: A Unified Platform for Evaluating the Security of Code GenAI [58.29510889419971]
Existing benchmarks for evaluating the security risks and capabilities of code-generating large language models (LLMs) face several key limitations.<n>We introduce a general and scalable benchmark construction framework that begins with manually validated, high-quality seed examples and expands them via targeted mutations.<n>Applying this framework to Python, C/C++, and Java, we build SeCodePLT, a dataset of more than 5.9k samples spanning 44 CWE-based risk categories and three security capabilities.
arXiv Detail & Related papers (2024-10-14T21:17:22Z)
Exploring Automatic Cryptographic API Misuse Detection in the Era of LLMs [60.32717556756674]
This paper introduces a systematic evaluation framework to assess Large Language Models in detecting cryptographic misuses. Our in-depth analysis of 11,940 LLM-generated reports highlights that the inherent instabilities in LLMs can lead to over half of the reports being false positives. The optimized approach achieves a remarkable detection rate of nearly 90%, surpassing traditional methods and uncovering previously unknown misuses in established benchmarks.
arXiv Detail & Related papers (2024-07-23T15:31:26Z)
MOSSBench: Is Your Multimodal Language Model Oversensitive to Safe Queries? [70.77691645678804]
Humans are prone to cognitive distortions -- biased thinking patterns that lead to exaggerated responses to specific stimuli. This paper demonstrates that advanced Multimodal Large Language Models (MLLMs) exhibit similar tendencies. We identify three types of stimuli that trigger the oversensitivity of existing MLLMs: Exaggerated Risk, Negated Harm, and Counterintuitive.
arXiv Detail & Related papers (2024-06-22T23:26:07Z)
SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal [64.9938658716425]
SORRY-Bench is a proposed benchmark for evaluating large language models' (LLMs) ability to recognize and reject unsafe user requests.<n>First, existing methods often use coarse-grained taxonomy of unsafe topics, and are over-representing some fine-grained topics.<n>Second, linguistic characteristics and formatting of prompts are often overlooked, like different languages, dialects, and more -- which are only implicitly considered in many evaluations.
arXiv Detail & Related papers (2024-06-20T17:56:07Z)
Understanding the Effectiveness of Large Language Models in Detecting Security Vulnerabilities [12.82645410161464]
We evaluate the effectiveness of 16 pre-trained Large Language Models on 5,000 code samples from five diverse security datasets. Overall, LLMs show modest effectiveness in detecting vulnerabilities, obtaining an average accuracy of 62.8% and F1 score of 0.71 across datasets. We find that advanced prompting strategies that involve step-by-step analysis significantly improve performance of LLMs on real-world datasets in terms of F1 score (by upto 0.18 on average)
arXiv Detail & Related papers (2023-11-16T13:17:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.