Related papers: Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering

Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering

URL: http://arxiv.org/abs/2505.15805v1
Date: Wed, 21 May 2025 17:58:11 GMT
Title: Keep Security! Benchmarking Security Policy Preservation in Large Language Model Contexts Against Indirect Attacks in Question Answering
Authors: Hwan Chang, Yumin Kim, Yonghyun Jun, Hwanhee Lee,
Abstract summary: Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government.<n>We introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering.<n>We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information.
Score: 3.6152232645741025
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) are increasingly deployed in sensitive domains such as enterprise and government, ensuring that they adhere to user-defined security policies within context is critical-especially with respect to information non-disclosure. While prior LLM studies have focused on general safety and socially sensitive data, large-scale benchmarks for contextual security preservation against attacks remain lacking. To address this, we introduce a novel large-scale benchmark dataset, CoPriva, evaluating LLM adherence to contextual non-disclosure policies in question answering. Derived from realistic contexts, our dataset includes explicit policies and queries designed as direct and challenging indirect attacks seeking prohibited information. We evaluate 10 LLMs on our benchmark and reveal a significant vulnerability: many models violate user-defined policies and leak sensitive information. This failure is particularly severe against indirect attacks, highlighting a critical gap in current LLM safety alignment for sensitive applications. Our analysis reveals that while models can often identify the correct answer to a query, they struggle to incorporate policy constraints during generation. In contrast, they exhibit a partial ability to revise outputs when explicitly prompted. Our findings underscore the urgent need for more robust methods to guarantee contextual security.

Related papers

A Survey on Data Security in Large Language Models [12.23432845300652]
Large Language Models (LLMs) are a foundation in advancing natural language processing, power applications such as text generation, machine translation, and conversational systems.<n>Despite their transformative potential, these models inherently rely on massive amounts of training data, often collected from diverse and uncurated sources, which exposes them to serious data security risks.<n>Harmful or malicious data can compromise model behavior, leading to issues such as toxic output, hallucinations, and vulnerabilities to threats such as prompt injection or data poisoning.<n>This survey offers a comprehensive overview of the main data security risks facing LLMs and reviews current defense strategies, including adversarial
arXiv Detail & Related papers (2025-08-04T11:28:34Z)
Multi-Stage Prompt Inference Attacks on Enterprise LLM Systems [18.039444159491733]
Large Language Models (LLMs) deployed in enterprise settings face novel security challenges.<n>One critical threat is prompt inference attacks: adversaries chain together seemingly benign prompts to gradually extract confidential data.<n>We present a comprehensive study of multi-stage prompt inference attacks in an enterprise LLM context.
arXiv Detail & Related papers (2025-07-21T13:38:12Z)
SoK: Semantic Privacy in Large Language Models [24.99241770349404]
This paper introduces a lifecycle-centric framework to analyze semantic privacy risks across input processing, pretraining, fine-tuning, and alignment stages of Large Language Models (LLMs)<n>We categorize key attack vectors and assess how current defenses, such as differential privacy, embedding encryption, edge computing, and unlearning, address these threats.<n>We conclude by outlining open challenges, including quantifying semantic leakage, protecting multimodal inputs, balancing de-identification with generation quality, and ensuring transparency in privacy enforcement.
arXiv Detail & Related papers (2025-06-30T08:08:15Z)
ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models [60.28667314609623]
Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications.<n>We propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM.
arXiv Detail & Related papers (2025-06-17T10:55:17Z)
LLM Access Shield: Domain-Specific LLM Framework for Privacy Policy Compliance [2.2022550150705804]
Large language models (LLMs) are increasingly applied in fields such as finance, education, and governance.<n>We propose a security framework to enforce policy compliance and mitigate risks in LLM interactions.
arXiv Detail & Related papers (2025-05-22T07:30:37Z)
Commercial LLM Agents Are Already Vulnerable to Simple Yet Dangerous Attacks [88.84977282952602]
A high volume of recent ML security literature focuses on attacks against aligned large language models (LLMs)<n>In this paper, we analyze security and privacy vulnerabilities that are unique to LLM agents.<n>We conduct a series of illustrative attacks on popular open-source and commercial agents, demonstrating the immediate practical implications of their vulnerabilities.
arXiv Detail & Related papers (2025-02-12T17:19:36Z)
Membership Inference Attack against Long-Context Large Language Models [8.788010048413188]
We argue that integrating all information into the long context makes it a repository of sensitive information. We propose six membership inference attack strategies tailored for LCLMs. We examine the underlying reasons why LCLMs are susceptible to revealing such membership information.
arXiv Detail & Related papers (2024-11-18T09:50:54Z)
Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress. Our investigation exposes a critical oversight in this belief. By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z)
ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy. It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z)
Security and Privacy Challenges of Large Language Models: A Survey [2.6986500640871482]
Large Language Models (LLMs) have demonstrated extraordinary capabilities and contributed to multiple fields, such as generating and summarizing text, language translation, and question-answering. These models are also vulnerable to security and privacy attacks, such as jailbreaking attacks, data poisoning attacks, and Personally Identifiable Information (PII) leakage attacks. This survey provides a thorough review of the security and privacy challenges of LLMs for both training data and users, along with the application-based risks in various domains, such as transportation, education, and healthcare.
arXiv Detail & Related papers (2024-01-30T04:00:54Z)
Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models [79.0183835295533]
We introduce the first benchmark for indirect prompt injection attacks, named BIPIA, to assess the risk of such vulnerabilities.<n>Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.<n>We propose two novel defense mechanisms-boundary awareness and explicit reminder-to address these vulnerabilities in both black-box and white-box settings.
arXiv Detail & Related papers (2023-12-21T01:08:39Z)
A Security Risk Taxonomy for Prompt-Based Interaction With Large Language Models [5.077431021127288]
This paper addresses a gap in current research by focusing on security risks posed by large language models (LLMs) Our work proposes a taxonomy of security risks along the user-model communication pipeline and categorizes the attacks by target and attack type alongside the commonly used confidentiality, integrity, and availability (CIA) triad.
arXiv Detail & Related papers (2023-11-19T20:22:05Z)
Privacy in Large Language Models: Attacks, Defenses and Future Directions [84.73301039987128]
We analyze the current privacy attacks targeting large language models (LLMs) and categorize them according to the adversary's assumed capabilities. We present a detailed overview of prominent defense strategies that have been developed to counter these privacy attacks.
arXiv Detail & Related papers (2023-10-16T13:23:54Z)

This list is automatically generated from the titles and abstracts of the papers in this site.