A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models
- URL: http://arxiv.org/abs/2602.15689v2
- Date: Wed, 18 Feb 2026 16:42:07 GMT
- Title: A Content-Based Framework for Cybersecurity Refusal Decisions in Large Language Models
- Authors: Noa Linder, Meirav Segal, Omer Antverg, Gil Gekker, Tomer Fichman, Omri Bodenheimer, Edan Maor, Omer Nevo,
- Abstract summary: We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit.<n>We introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit.
- Score: 0.9603139911465765
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models and LLM-based agents are increasingly used for cybersecurity tasks that are inherently dual-use. Existing approaches to refusal, spanning academic policy frameworks and commercially deployed systems, often rely on broad topic-based bans or offensive-focused taxonomies. As a result, they can yield inconsistent decisions, over-restrict legitimate defenders, and behave brittlely under obfuscation or request segmentation. We argue that effective refusal requires explicitly modeling the trade-off between offensive risk and defensive benefit, rather than relying solely on intent or offensive classification. In this paper, we introduce a content-based framework for designing and auditing cyber refusal policies that makes offense-defense tradeoffs explicit. The framework characterizes requests along five dimensions: Offensive Action Contribution, Offensive Risk, Technical Complexity, Defensive Benefit, and Expected Frequency for Legitimate Users, grounded in the technical substance of the request rather than stated intent. We demonstrate that this content-grounded approach resolves inconsistencies in current frontier model behavior and allows organizations to construct tunable, risk-aware refusal policies.
Related papers
- Defensive Refusal Bias: How Safety Alignment Fails Cyber Defenders [1.038167357593269]
Safety alignment in large language models (LLMs) primarily focuses on preventing misuse.<n>We study Defensive Refusal Bias -- the tendency of safety-tuned frontier LLMs to refuse assistance for authorized defensive cybersecurity tasks.<n>Highest refusal rates occur in the most operationally critical tasks: system hardening (43.8%) and malware analysis (34.3%)
arXiv Detail & Related papers (2026-03-01T19:53:19Z) - YuFeng-XGuard: A Reasoning-Centric, Interpretable, and Flexible Guardrail Model for Large Language Models [36.084240131323824]
We present YuFeng-XGuard, a reasoning-centric guardrail model family for large language models (LLMs)<n>Instead of producing opaque binary judgments, YuFeng-XGuard generates structured risk predictions, including explicit risk categories and confidence scores.<n>We introduce a dynamic policy mechanism that decouples risk perception from policy enforcement, allowing safety policies to be adjusted without model retraining.
arXiv Detail & Related papers (2026-01-22T02:23:18Z) - SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models [67.84174763413178]
We introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection.<n>We show that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks.
arXiv Detail & Related papers (2026-01-13T15:01:38Z) - Learning to Extract Context for Context-Aware LLM Inference [60.376872353918394]
User prompts to large language models (LLMs) are often ambiguous or under-specified.<n> contextual cues shaped by user intentions, prior knowledge, and risk factors influence what constitutes an appropriate response.<n>We propose a framework that extracts and leverages such contextual information from the user prompt itself.
arXiv Detail & Related papers (2025-12-12T19:10:08Z) - KG-DF: A Black-box Defense Framework against Jailbreak Attacks Based on Knowledge Graphs [22.335638814557004]
We propose a Knowledge Graph Defense Framework (KG-DF) for large language models (LLMs)<n>Because of its structured knowledge representation and semantic association capabilities, Knowledge Graph(KG) can be searched by associating input content with safe knowledge in the knowledge base.<n>We introduce an semantic parsing module, whose core task is to transform the input query into a set of structured and secure concept representations.
arXiv Detail & Related papers (2025-11-09T14:39:40Z) - RAG Security and Privacy: Formalizing the Threat Model and Attack Surface [4.823988025629304]
Retrieval-Augmented Generation (RAG) is an emerging approach in natural language processing that combines large language models (LLMs) with external document retrieval to produce more accurate and grounded responses.<n>Existing research has demonstrated that RAGs can leak sensitive information through training data memorization or adversarial prompts, and RAG systems inherit many of these vulnerabilities.<n>Despite these risks, there is currently no formal framework that defines the threat landscape for RAG systems.
arXiv Detail & Related papers (2025-09-24T17:11:35Z) - Evaluating Language Model Reasoning about Confidential Information [95.64687778185703]
We study whether language models exhibit contextual robustness, or the capability to adhere to context-dependent safety specifications.<n>We develop a benchmark (PasswordEval) that measures whether language models can correctly determine when a user request is authorized.<n>We find that current open- and closed-source models struggle with this seemingly simple task, and that, perhaps surprisingly, reasoning capabilities do not generally improve performance.
arXiv Detail & Related papers (2025-08-27T15:39:46Z) - Effective Red-Teaming of Policy-Adherent Agents [10.522087614181745]
Task-oriented LLM-based agents are increasingly used in domains with strict policies, such as refund eligibility or cancellation rules.<n>We propose a novel threat model that focuses on adversarial users aiming to exploit policy-adherent agents for personal benefit.<n>We present CRAFT, a multi-agent red-teaming system that leverages policy-aware persuasive strategies to undermine a policy-adherent agent in a customer-service scenario.
arXiv Detail & Related papers (2025-06-11T10:59:47Z) - Reformulation is All You Need: Addressing Malicious Text Features in DNNs [53.45564571192014]
We propose a unified and adaptive defense framework that is effective against both adversarial and backdoor attacks.<n>Our framework outperforms existing sample-oriented defense baselines across a diverse range of malicious textual features.
arXiv Detail & Related papers (2025-02-02T03:39:43Z) - Deliberative Alignment: Reasoning Enables Safer Language Models [64.60765108418062]
We introduce Deliberative Alignment, a new paradigm that teaches the model safety specifications and trains it to explicitly recall and accurately reason over the specifications before answering.<n>We used this approach to align OpenAI's o-series models, and achieved highly precise adherence to OpenAI's safety policies, without requiring human-written chain-of-thoughts or answers.
arXiv Detail & Related papers (2024-12-20T21:00:11Z) - From Mean to Extreme: Formal Differential Privacy Bounds on the Success of Real-World Data Reconstruction Attacks [54.25638567385662]
Differential Privacy in machine learning is often interpreted as guarantees against membership inference.<n> translating DP budgets into quantitative protection against the more damaging threat of data reconstruction remains a challenging open problem.<n>This paper bridges the critical gap by deriving the first formal privacy bounds tailored to the mechanics of demonstrated "from-scratch" attacks.
arXiv Detail & Related papers (2024-02-20T09:52:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.