SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations
- URL: http://arxiv.org/abs/2601.07835v1
- Date: Mon, 12 Jan 2026 18:59:45 GMT
- Title: SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations
- Authors: Mohammed Himayath Ali, Mohammed Aqib Abdullah, Mohammed Mudassir Uddin, Shahnawaz Alam,
- Abstract summary: This paper introduces SecureCAI, a novel defense framework extending Constitutional AI principles with security-aware guardrails.<n>SecureCAI reduces attack success rates by 94.7% compared to baseline models.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models have emerged as transformative tools for Security Operations Centers, enabling automated log analysis, phishing triage, and malware explanation; however, deployment in adversarial cybersecurity environments exposes critical vulnerabilities to prompt injection attacks where malicious instructions embedded in security artifacts manipulate model behavior. This paper introduces SecureCAI, a novel defense framework extending Constitutional AI principles with security-aware guardrails, adaptive constitution evolution, and Direct Preference Optimization for unlearning unsafe response patterns, addressing the unique challenges of high-stakes security contexts where traditional safety mechanisms prove insufficient against sophisticated adversarial manipulation. Experimental evaluation demonstrates that SecureCAI reduces attack success rates by 94.7% compared to baseline models while maintaining 95.1% accuracy on benign security analysis tasks, with the framework incorporating continuous red-teaming feedback loops enabling dynamic adaptation to emerging attack strategies and achieving constitution adherence scores exceeding 0.92 under sustained adversarial pressure, thereby establishing a foundation for trustworthy integration of language model capabilities into operational cybersecurity workflows and addressing a critical gap in current approaches to AI safety within adversarial domains.
Related papers
- Co-RedTeam: Orchestrated Security Discovery and Exploitation with LLM Agents [57.49020237126194]
Large language models (LLMs) have shown promise in assisting cybersecurity tasks, yet existing approaches struggle with automatic vulnerability discovery and exploitation.<n>We propose Co-RedTeam, a security-aware multi-agent framework designed to mirror real-world red-teaming.<n>Co-RedTeam decomposes vulnerability analysis into coordinated discovery and exploitation stages, enabling agents to plan, execute, validate, and refine actions.
arXiv Detail & Related papers (2026-02-02T14:38:45Z) - ORCA -- An Automated Threat Analysis Pipeline for O-RAN Continuous Development [57.61878484176942]
Open-Radio Access Network (O-RAN) integrates numerous software components in a cloud-like deployment, opening the radio access network to previously unconsidered security threats.<n>Current vulnerability assessment practices often rely on manual, labor-intensive, and subjective investigations, leading to inconsistencies in the threat analysis.<n>We propose an automated pipeline that leverages Natural Language Processing (NLP) to minimize human intervention and associated biases.
arXiv Detail & Related papers (2026-01-20T07:31:59Z) - A Call to Action for a Secure-by-Design Generative AI Paradigm [0.0]
Large language models (LLMs) are vulnerable to prompt injection and other adversarial attacks.<n>This paper introduces PromptShield, a framework that ensures deterministic and secure prompt interactions.<n>Our results demonstrate a significant improvement in model security and performance, achieving precision, recall, and F1 scores of approximately 94%.
arXiv Detail & Related papers (2025-10-01T03:05:07Z) - SafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMs [37.82193156438782]
This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process.<n>We propose a novel multi-agent framework SafeEvalAgent, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark.<n>Our experiments demonstrate the effectiveness of SafeEvalAgent, showing a consistent decline in model safety as the evaluation hardens.
arXiv Detail & Related papers (2025-09-30T11:20:41Z) - Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z) - Thought Purity: A Defense Framework For Chain-of-Thought Attack [16.56580534764132]
We propose Thought Purity, a framework that strengthens resistance to malicious content while preserving operational efficacy.<n>Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems.
arXiv Detail & Related papers (2025-07-16T15:09:13Z) - PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training [0.5439020425819]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, yet they pose significant security risks that threaten their safe deployment in critical domains.<n>This paper presents a novel PRM-free security alignment framework that leverages automated red teaming and adversarial training to achieve robust security guarantees while maintaining computational efficiency.
arXiv Detail & Related papers (2025-07-14T17:41:12Z) - Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense [90.71884758066042]
Large vision-language models (LVLMs) introduce a unique vulnerability: susceptibility to malicious attacks via visual inputs.<n>We propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism.
arXiv Detail & Related papers (2025-03-14T17:39:45Z) - CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning [2.549390156222399]
The integration of large language models into cyber security applications presents both opportunities and critical safety risks.<n>We introduce CyberLLMInstruct, a dataset of 54,928 pseudo-malicious instruction-response pairs spanning cyber security tasks.
arXiv Detail & Related papers (2025-03-12T12:29:27Z) - AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement [73.0700818105842]
We introduce AISafetyLab, a unified framework and toolkit that integrates representative attack, defense, and evaluation methodologies for AI safety.<n> AISafetyLab features an intuitive interface that enables developers to seamlessly apply various techniques.<n>We conduct empirical studies on Vicuna, analyzing different attack and defense strategies to provide valuable insights into their comparative effectiveness.
arXiv Detail & Related papers (2025-02-24T02:11:52Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - ASSERT: Automated Safety Scenario Red Teaming for Evaluating the
Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection.
We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance.
We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.