Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs
- URL: http://arxiv.org/abs/2508.09288v2
- Date: Mon, 18 Aug 2025 18:20:18 GMT
- Title: Can AI Keep a Secret? Contextual Integrity Verification: A Provable Security Architecture for LLMs
- Authors: Aayush Gupta,
- Abstract summary: We present Contextual Integrity Verification (CIV), a provenance-time security architecture that attaches cryptographic inferenceally signed labels to every token.<n>CIV provides provenance, per-token non-interference guarantees on frozen models.<n>We demonstrate drop-in protection for Llama-3-8B and Mistral-7B.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) remain acutely vulnerable to prompt injection and related jailbreak attacks; heuristic guardrails (rules, filters, LLM judges) are routinely bypassed. We present Contextual Integrity Verification (CIV), an inference-time security architecture that attaches cryptographically signed provenance labels to every token and enforces a source-trust lattice inside the transformer via a pre-softmax hard attention mask (with optional FFN/residual gating). CIV provides deterministic, per-token non-interference guarantees on frozen models: lower-trust tokens cannot influence higher-trust representations. On benchmarks derived from recent taxonomies of prompt-injection vectors (Elite-Attack + SoK-246), CIV attains 0% attack success rate under the stated threat model while preserving 93.1% token-level similarity and showing no degradation in model perplexity on benign tasks; we note a latency overhead attributable to a non-optimized data path. Because CIV is a lightweight patch -- no fine-tuning required -- we demonstrate drop-in protection for Llama-3-8B and Mistral-7B. We release a reference implementation, an automated certification harness, and the Elite-Attack corpus to support reproducible research.
Related papers
- Trust The Typical [8.32740388004069]
We introduce Trust The Typical (T3), a framework that operationalizes this principle by treating safety as an out-of-distribution (OOD) detection problem.<n>T3 learns the distribution of acceptable prompts in a semantic space and flags any significant deviation as a potential threat.<n>A single model trained only on safe English text transfers effectively to diverse domains and over 14 languages without retraining.
arXiv Detail & Related papers (2026-02-04T14:06:46Z) - CIPHER: Cryptographic Insecurity Profiling via Hybrid Evaluation of Responses [0.0]
We introduce CIPHER, a benchmark for measuring cryptographic vulnerability incidence in Python code.<n> CIPHER uses insecure/neutral/secure prompt variants per task, a cryptography-specific vulnerability taxonomy, and line-level attribution.<n>We find that explicit secure prompting reduces some targeted issues but does not reliably eliminate cryptographic vulnerabilities overall.
arXiv Detail & Related papers (2026-02-01T21:06:54Z) - ReasAlign: Reasoning Enhanced Safety Alignment against Prompt Injection Attack [52.17935054046577]
We present ReasAlign, a model-level solution to improve safety alignment against indirect prompt injection attacks.<n>ReasAlign incorporates structured reasoning steps to analyze user queries, detect conflicting instructions, and preserve the continuity of the user's intended tasks.
arXiv Detail & Related papers (2026-01-15T08:23:38Z) - The Trojan Knowledge: Bypassing Commercial LLM Guardrails via Harmless Prompt Weaving and Adaptive Tree Search [58.8834056209347]
Large language models (LLMs) remain vulnerable to jailbreak attacks that bypass safety guardrails to elicit harmful outputs.<n>We introduce the Correlated Knowledge Attack Agent (CKA-Agent), a dynamic framework that reframes jailbreaking as an adaptive, tree-structured exploration of the target model's knowledge base.
arXiv Detail & Related papers (2025-12-01T07:05:23Z) - DiffuGuard: How Intrinsic Safety is Lost and Found in Diffusion Large Language Models [50.21378052667732]
We conduct an in-depth analysis of dLLM vulnerabilities to jailbreak attacks across two distinct dimensions: intra-step and inter-step dynamics.<n>We propose DiffuGuard, a training-free defense framework that addresses vulnerabilities through a dual-stage approach.
arXiv Detail & Related papers (2025-09-29T05:17:10Z) - Towards Safeguarding LLM Fine-tuning APIs against Cipher Attacks [10.478976654618272]
adversaries can exploit large language model fine-tuning APIs to bypass model safety mechanisms.<n>We introduce the Cipher Fine-tuning Robustness benchmark (CIFR), a benchmark for evaluating defense strategies' ability to retain model safety.<n>We show that probe monitors achieve over 99% detection accuracy, generalize to unseen cipher variants and families, and compare favorably to state-of-the-art monitoring approaches.
arXiv Detail & Related papers (2025-08-23T22:55:15Z) - Wolf Hidden in Sheep's Conversations: Toward Harmless Data-Based Backdoor Attacks for Jailbreaking Large Language Models [69.11679786018206]
Supervised fine-tuning (SFT) aligns large language models with human intent by training them on labeled task-specific data.<n>Recent studies have shown that malicious attackers can inject backdoors into these models by embedding triggers into the harmful question-answer pairs.<n>We propose a novel clean-data backdoor attack for jailbreaking LLMs.
arXiv Detail & Related papers (2025-05-23T08:13:59Z) - One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models [20.42976162135529]
Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research.<n>We propose textttD-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM.
arXiv Detail & Related papers (2025-05-12T01:26:50Z) - Towards Copyright Protection for Knowledge Bases of Retrieval-augmented Language Models via Reasoning [58.57194301645823]
Large language models (LLMs) are increasingly integrated into real-world personalized applications.<n>The valuable and often proprietary nature of the knowledge bases used in RAG introduces the risk of unauthorized usage by adversaries.<n>Existing methods that can be generalized as watermarking techniques to protect these knowledge bases typically involve poisoning or backdoor attacks.<n>We propose name for harmless' copyright protection of knowledge bases.
arXiv Detail & Related papers (2025-02-10T09:15:56Z) - Towards Universal and Black-Box Query-Response Only Attack on LLMs with QROA [2.4578723416255754]
We introduce QROA, a black-box jailbreak method to identify adversarial suffixes when appended to a malicious instruction.<n>Unlike existing suffix-based jailbreak approaches, QROA does not require access to the model's logit or any other internal information.<n>We also propose QROA-UNV, an extension that identifies universal adversarial suffixes for individual models.
arXiv Detail & Related papers (2024-06-04T07:27:36Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - Certifying LLM Safety against Adversarial Prompting [70.96868018621167]
Large language models (LLMs) are vulnerable to adversarial attacks that add malicious tokens to an input prompt.<n>We introduce erase-and-check, the first framework for defending against adversarial prompts with certifiable safety guarantees.
arXiv Detail & Related papers (2023-09-06T04:37:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.