Related papers: SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations

URL: http://arxiv.org/abs/2601.07835v1
Date: Mon, 12 Jan 2026 18:59:45 GMT
Title: SecureCAI: Injection-Resilient LLM Assistants for Cybersecurity Operations
Authors: Mohammed Himayath Ali, Mohammed Aqib Abdullah, Mohammed Mudassir Uddin, Shahnawaz Alam,
Abstract summary: This paper introduces SecureCAI, a novel defense framework extending Constitutional AI principles with security-aware guardrails.<n>SecureCAI reduces attack success rates by 94.7% compared to baseline models.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models have emerged as transformative tools for Security Operations Centers, enabling automated log analysis, phishing triage, and malware explanation; however, deployment in adversarial cybersecurity environments exposes critical vulnerabilities to prompt injection attacks where malicious instructions embedded in security artifacts manipulate model behavior. This paper introduces SecureCAI, a novel defense framework extending Constitutional AI principles with security-aware guardrails, adaptive constitution evolution, and Direct Preference Optimization for unlearning unsafe response patterns, addressing the unique challenges of high-stakes security contexts where traditional safety mechanisms prove insufficient against sophisticated adversarial manipulation. Experimental evaluation demonstrates that SecureCAI reduces attack success rates by 94.7% compared to baseline models while maintaining 95.1% accuracy on benign security analysis tasks, with the framework incorporating continuous red-teaming feedback loops enabling dynamic adaptation to emerging attack strategies and achieving constitution adherence scores exceeding 0.92 under sustained adversarial pressure, thereby establishing a foundation for trustworthy integration of language model capabilities into operational cybersecurity workflows and addressing a critical gap in current approaches to AI safety within adversarial domains.

Related papers

Co-RedTeam: Orchestrated Security Discovery and Exploitation with LLM Agents [57.49020237126194]
Large language models (LLMs) have shown promise in assisting cybersecurity tasks, yet existing approaches struggle with automatic vulnerability discovery and exploitation.<n>We propose Co-RedTeam, a security-aware multi-agent framework designed to mirror real-world red-teaming.<n>Co-RedTeam decomposes vulnerability analysis into coordinated discovery and exploitation stages, enabling agents to plan, execute, validate, and refine actions.
arXiv Detail & Related papers (2026-02-02T14:38:45Z)
ORCA -- An Automated Threat Analysis Pipeline for O-RAN Continuous Development [57.61878484176942]
Open-Radio Access Network (O-RAN) integrates numerous software components in a cloud-like deployment, opening the radio access network to previously unconsidered security threats.<n>Current vulnerability assessment practices often rely on manual, labor-intensive, and subjective investigations, leading to inconsistencies in the threat analysis.<n>We propose an automated pipeline that leverages Natural Language Processing (NLP) to minimize human intervention and associated biases.
arXiv Detail & Related papers (2026-01-20T07:31:59Z)
A Call to Action for a Secure-by-Design Generative AI Paradigm [0.0]
Large language models (LLMs) are vulnerable to prompt injection and other adversarial attacks.<n>This paper introduces PromptShield, a framework that ensures deterministic and secure prompt interactions.<n>Our results demonstrate a significant improvement in model security and performance, achieving precision, recall, and F1 scores of approximately 94%.
arXiv Detail & Related papers (2025-10-01T03:05:07Z)
SafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMs [37.82193156438782]
This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process.<n>We propose a novel multi-agent framework SafeEvalAgent, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark.<n>Our experiments demonstrate the effectiveness of SafeEvalAgent, showing a consistent decline in model safety as the evaluation hardens.
arXiv Detail & Related papers (2025-09-30T11:20:41Z)
Secure Tug-of-War (SecTOW): Iterative Defense-Attack Training with Reinforcement Learning for Multimodal Model Security [63.41350337821108]
We propose Secure Tug-of-War (SecTOW) to enhance the security of multimodal large language models (MLLMs)<n>SecTOW consists of two modules: a defender and an auxiliary attacker, both trained iteratively using reinforcement learning (GRPO)<n>We show that SecTOW significantly improves security while preserving general performance.
arXiv Detail & Related papers (2025-07-29T17:39:48Z)
Thought Purity: A Defense Framework For Chain-of-Thought Attack [16.56580534764132]
We propose Thought Purity, a framework that strengthens resistance to malicious content while preserving operational efficacy.<n>Our approach establishes the first comprehensive defense mechanism against CoTA vulnerabilities in reinforcement learning-aligned reasoning systems.
arXiv Detail & Related papers (2025-07-16T15:09:13Z)
PRM-Free Security Alignment of Large Models via Red Teaming and Adversarial Training [0.5439020425819]
Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse applications, yet they pose significant security risks that threaten their safe deployment in critical domains.<n>This paper presents a novel PRM-free security alignment framework that leverages automated red teaming and adversarial training to achieve robust security guarantees while maintaining computational efficiency.
arXiv Detail & Related papers (2025-07-14T17:41:12Z)
Tit-for-Tat: Safeguarding Large Vision-Language Models Against Jailbreak Attacks via Adversarial Defense [90.71884758066042]
Large vision-language models (LVLMs) introduce a unique vulnerability: susceptibility to malicious attacks via visual inputs.<n>We propose ESIII (Embedding Security Instructions Into Images), a novel methodology for transforming the visual space from a source of vulnerability into an active defense mechanism.
arXiv Detail & Related papers (2025-03-14T17:39:45Z)
CyberLLMInstruct: A Pseudo-malicious Dataset Revealing Safety-performance Trade-offs in Cyber Security LLM Fine-tuning [2.549390156222399]
The integration of large language models into cyber security applications presents both opportunities and critical safety risks.<n>We introduce CyberLLMInstruct, a dataset of 54,928 pseudo-malicious instruction-response pairs spanning cyber security tasks.
arXiv Detail & Related papers (2025-03-12T12:29:27Z)
AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement [73.0700818105842]
We introduce AISafetyLab, a unified framework and toolkit that integrates representative attack, defense, and evaluation methodologies for AI safety.<n> AISafetyLab features an intuitive interface that enables developers to seamlessly apply various techniques.<n>We conduct empirical studies on Vicuna, analyzing different attack and defense strategies to provide valuable insights into their comparative effectiveness.
arXiv Detail & Related papers (2025-02-24T02:11:52Z)
Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [54.10710423370126]
We propose Reasoning-to-Defend (R2D), a training paradigm that integrates a safety-aware reasoning mechanism into Large Language Models' generation process.<n>CPO enhances the model's perception of the safety status of given dialogues.<n>Experiments demonstrate that R2D effectively mitigates various attacks and improves overall safety, while maintaining the original performances.
arXiv Detail & Related papers (2025-02-18T15:48:46Z)
ASSERT: Automated Safety Scenario Red Teaming for Evaluating the Robustness of Large Language Models [65.79770974145983]
ASSERT, Automated Safety Scenario Red Teaming, consists of three methods -- semantically aligned augmentation, target bootstrapping, and adversarial knowledge injection. We partition our prompts into four safety domains for a fine-grained analysis of how the domain affects model performance. We find statistically significant performance differences of up to 11% in absolute classification accuracy among semantically related scenarios and error rates of up to 19% absolute error in zero-shot adversarial settings.
arXiv Detail & Related papers (2023-10-14T17:10:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.