RedSage: A Cybersecurity Generalist LLM
- URL: http://arxiv.org/abs/2601.22159v1
- Date: Thu, 29 Jan 2026 18:59:57 GMT
- Title: RedSage: A Cybersecurity Generalist LLM
- Authors: Naufal Suryanto, Muzammal Naseer, Pengfei Li, Syed Talal Wasim, Jinhui Yi, Juergen Gall, Paolo Ceravolo, Ernesto Damiani,
- Abstract summary: RedSage is an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training.<n>We use a large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools.<n>RedSage is evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization.
- Score: 45.91667919408369
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cybersecurity operations demand assistant LLMs that support diverse workflows without exposing sensitive data. Existing solutions either rely on proprietary APIs with privacy risks or on open models lacking domain adaptation. To bridge this gap, we curate 11.8B tokens of cybersecurity-focused continual pretraining data via large-scale web filtering and manual collection of high-quality resources, spanning 28.6K documents across frameworks, offensive techniques, and security tools. Building on this, we design an agentic augmentation pipeline that simulates expert workflows to generate 266K multi-turn cybersecurity samples for supervised fine-tuning. Combined with general open-source LLM data, these resources enable the training of RedSage, an open-source, locally deployable cybersecurity assistant with domain-aware pretraining and post-training. To rigorously evaluate the models, we introduce RedSage-Bench, a benchmark with 30K multiple-choice and 240 open-ended Q&A items covering cybersecurity knowledge, skills, and tool expertise. RedSage is further evaluated on established cybersecurity benchmarks (e.g., CTI-Bench, CyberMetric, SECURE) and general LLM benchmarks to assess broader generalization. At the 8B scale, RedSage achieves consistently better results, surpassing the baseline models by up to +5.59 points on cybersecurity benchmarks and +5.05 points on Open LLM Leaderboard tasks. These findings demonstrate that domain-aware agentic augmentation and pre/post-training can not only enhance cybersecurity-specific expertise but also help to improve general reasoning and instruction-following. All models, datasets, and code are publicly available.
Related papers
- Llama-3.1-FoundationAI-SecurityLLM-Reasoning-8B Technical Report [48.22833523688154]
We present Foundation-Sec-8B-Reasoning, the first open-source native reasoning model for cybersecurity.<n>The model is trained through a two-stage process combining supervised fine-tuning (SFT) and reinforcement learning from verifiable rewards (RLVR)
arXiv Detail & Related papers (2026-01-28T21:15:24Z) - CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence [48.63397742510097]
Cyber threat intelligence (CTI) is central to modern cybersecurity, providing critical insights for detecting and mitigating evolving threats.<n>With the natural language understanding and reasoning capabilities of large language models (LLMs), there is increasing interest in applying them to CTI.<n>We present CTIArena, the first benchmark for evaluating LLM performance on heterogeneous, multi-source CTI.
arXiv Detail & Related papers (2025-10-13T22:10:17Z) - CyberSOCEval: Benchmarking LLMs Capabilities for Malware Analysis and Threat Intelligence Reasoning [1.3863707631653515]
Cyber defenders are overwhelmed by a deluge of security alerts, threat intelligence signals, and shifting business context.<n>Existing evaluations do not fully assess the scenarios most relevant to real-world defenders.<n>We introduce CyberSOCEval, a new suite of open source benchmarks within CyberSecEval 4.
arXiv Detail & Related papers (2025-09-24T14:33:07Z) - Less Data, More Security: Advancing Cybersecurity LLMs Specialization via Resource-Efficient Domain-Adaptive Continuous Pre-training with Minimal Tokens [1.2116854758481395]
Domain-Adaptive Continuous Pretraining (DAP) is a methodology for enhancing cybersecurity understanding in large language models (LLMs)<n>We adapted three decoder-based architectures using a curated 126-million-word cybersecurity corpus from standards, academic literature, and various other sources.<n>The Llama-3.3-70B-Ins-DAP model achieved state-of-the-art accuracies of 0.718, 0.933, and 0.864, respectively, outperforming specialized models.
arXiv Detail & Related papers (2025-06-30T12:59:29Z) - CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale [45.97598662617568]
We introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects.<n>We show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches.<n>These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
arXiv Detail & Related papers (2025-06-03T07:35:14Z) - Llama-3.1-FoundationAI-SecurityLLM-Base-8B Technical Report [50.268821168513654]
We present Foundation-Sec-8B, a cybersecurity-focused large language model (LLMs) built on the Llama 3.1 architecture.<n>We evaluate it across both established and new cybersecurity benchmarks, showing that it matches Llama 3.1-70B and GPT-4o-mini in certain cybersecurity-specific tasks.<n>By releasing our model to the public, we aim to accelerate progress and adoption of AI-driven tools in both public and private cybersecurity contexts.
arXiv Detail & Related papers (2025-04-28T08:41:12Z) - The Digital Cybersecurity Expert: How Far Have We Come? [49.89857422097055]
We develop CSEBenchmark, a fine-grained cybersecurity evaluation framework based on 345 knowledge points expected of cybersecurity experts.<n>We evaluate 12 popular large language models (LLMs) on CSEBenchmark and find that even the best-performing model achieves only 85.42% overall accuracy.<n>By identifying and addressing specific knowledge gaps in each LLM, we achieve up to an 84% improvement in correcting previously incorrect predictions.
arXiv Detail & Related papers (2025-04-16T05:36:28Z) - How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities [62.474732677086855]
Large language model (LLM) routing has emerged as a crucial strategy for balancing computational costs with performance.<n>We propose the DSC benchmark: Diverse, Simple, and Categorized, an evaluation framework that categorizes router performance across a broad spectrum of query types.
arXiv Detail & Related papers (2025-03-20T19:52:30Z) - CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning [2.549390156222399]
The integration of large language models into cyber security applications presents both opportunities and critical safety risks.<n>We introduce CyberLLMInstruct, a dataset of 54,928 pseudo-malicious instruction-response pairs spanning cyber security tasks.
arXiv Detail & Related papers (2025-03-12T12:29:27Z) - Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training [1.4027589547318844]
Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine.<n>We present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation.<n>Continual pre-training on our dataset yields a 15.9% improvement in the aggregate score, while reasoning distillation leads to a 15.8% gain in security certification.
arXiv Detail & Related papers (2025-02-16T16:34:49Z) - Large Language Models for Cyber Security: A Systematic Literature Review [17.073186844004148]
Large Language Models (LLMs) have opened up new opportunities for leveraging artificial intelligence in a variety of application domains, including cybersecurity.<n>LLMs are being applied to an expanding range of cybersecurity tasks, including vulnerability detection, malware analysis, and network intrusion detection.<n>A significant emerging trend is the use of LLM-based autonomous agents, which represent a paradigm shift from single-task execution to orchestrating complex, multi-step security.
arXiv Detail & Related papers (2024-05-08T02:09:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.