Related papers: Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats

URL: http://arxiv.org/abs/2411.17693v1
Date: Tue, 26 Nov 2024 18:58:20 GMT
Title: Adaptive Deployment of Untrusted LLMs Reduces Distributed Threats
Authors: Jiaxin Wen, Vivek Hebbar, Caleb Larson, Aryan Bhatt, Ansh Radhakrishnan, Mrinank Sharma, Henry Sleight, Shi Feng, He He, Ethan Perez, Buck Shlegeris, Akbir Khan,
Abstract summary: We investigate whether safety measures remain effective even if large language models intentionally try to bypass them. We use a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.
Score: 22.843390303635655
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As large language models (LLMs) become increasingly capable, it is prudent to assess whether safety measures remain effective even if LLMs intentionally try to bypass them. Previous work introduced control evaluations, an adversarial framework for testing deployment strategies of untrusted models (i.e., models which might be trying to bypass safety measures). While prior work treats a single failure as unacceptable, we perform control evaluations in a "distributed threat setting" -- a setting where no single action is catastrophic and no single action provides overwhelming evidence of misalignment. We approach this problem with a two-level deployment framework that uses an adaptive macro-protocol to choose between micro-protocols. Micro-protocols operate on a single task, using a less capable, but extensively tested (trusted) model to harness and monitor the untrusted model. Meanwhile, the macro-protocol maintains an adaptive credence on the untrusted model's alignment based on its past actions, using it to pick between safer and riskier micro-protocols. We evaluate our method in a code generation testbed where a red team attempts to generate subtly backdoored code with an LLM whose deployment is safeguarded by a blue team. We plot Pareto frontiers of safety (# of non-backdoored solutions) and usefulness (# of correct solutions). At a given level of usefulness, our adaptive deployment strategy reduces the number of backdoors by 80% compared to non-adaptive baselines.

Related papers

AegisLLM: Scaling Agentic Systems for Self-Reflective Defense in LLM Security [74.22452069013289]
AegisLLM is a cooperative multi-agent defense against adversarial attacks and information leakage. We show that scaling agentic reasoning system at test-time substantially enhances robustness without compromising model utility. Comprehensive evaluations across key threat scenarios, including unlearning and jailbreaking, demonstrate the effectiveness of AegisLLM.
arXiv Detail & Related papers (2025-04-29T17:36:05Z)
Collab: Controlled Decoding using Mixture of Agents for LLM Alignment [90.6117569025754]
Reinforcement learning from human feedback has emerged as an effective technique to align Large Language models. Controlled Decoding provides a mechanism for aligning a model at inference time without retraining. We propose a mixture of agent-based decoding strategies leveraging the existing off-the-shelf aligned LLM policies.
arXiv Detail & Related papers (2025-03-27T17:34:25Z)
A sketch of an AI control safety case [3.753791609999324]
As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a "control safety case" This safety case sketch is a step toward more concrete arguments that can be used to show that a dangerously capable LLM agent is safe to deploy.
arXiv Detail & Related papers (2025-01-28T21:52:15Z)
Position: On-Premises LLM Deployment Demands a Middle Path: Preserving Privacy Without Sacrificing Model Confidentiality [18.575663556525864]
We argue that deploying closed-source LLMs within user-controlled infrastructure enhances data privacy and mitigates misuse risks. A well-designed on-premises deployment must ensure model confidentiality -- by preventing model theft -- and offer privacy-preserving customization. Our findings demonstrate that privacy and confidentiality can coexist, paving the way for secure on-premises AI deployment.
arXiv Detail & Related papers (2024-10-15T02:00:36Z)
ASPIRER: Bypassing System Prompts With Permutation-based Backdoors in LLMs [17.853862145962292]
We introduce a novel backdoor attack that systematically bypasses system prompts. Our method achieves an attack success rate (ASR) of up to 99.50% while maintaining a clean accuracy (CACC) of 98.58%.
arXiv Detail & Related papers (2024-10-05T02:58:20Z)
Efficiently Deploying LLMs with Controlled Risk [0.9208007322096532]
We present hierarchical chains with multi-level abstention (HCMA), which use model-intrinsic uncertainty to delegate queries. Our framework presents novel trade-offs between efficiency and risk.
arXiv Detail & Related papers (2024-10-03T03:25:56Z)
Criticality and Safety Margins for Reinforcement Learning [53.10194953873209]
We seek to define a criticality framework with both a quantifiable ground truth and a clear significance to users. We introduce true criticality as the expected drop in reward when an agent deviates from its policy for n consecutive random actions. We also introduce the concept of proxy criticality, a low-overhead metric that has a statistically monotonic relationship to true criticality.
arXiv Detail & Related papers (2024-09-26T21:00:45Z)
No Regrets: Investigating and Improving Regret Approximations for Curriculum Discovery [53.08822154199948]
Unsupervised Environment Design (UED) methods have gained recent attention as their adaptive curricula promise to enable agents to be robust to in- and out-of-distribution tasks. This work investigates how existing UED methods select training environments, focusing on task prioritisation metrics. We develop a method that directly trains on scenarios with high learnability.
arXiv Detail & Related papers (2024-08-27T14:31:54Z)
Uncertainty is Fragile: Manipulating Uncertainty in Large Language Models [79.76293901420146]
Large Language Models (LLMs) are employed across various high-stakes domains, where the reliability of their outputs is crucial. Our research investigates the fragility of uncertainty estimation and explores potential attacks. We demonstrate that an attacker can embed a backdoor in LLMs, which, when activated by a specific trigger in the input, manipulates the model's uncertainty without affecting the final output.
arXiv Detail & Related papers (2024-07-15T23:41:11Z)
Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs) We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position. DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z)
InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance [56.184255657175335]
We develop textbfInferAligner, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
arXiv Detail & Related papers (2024-01-20T10:41:03Z)
Can We Trust the Unlabeled Target Data? Towards Backdoor Attack and Defense on Model Adaptation [120.42853706967188]
We explore the potential backdoor attacks on model adaptation launched by well-designed poisoning target data. We propose a plug-and-play method named MixAdapt, combining it with existing adaptation algorithms.
arXiv Detail & Related papers (2024-01-11T16:42:10Z)
Model-Assisted Probabilistic Safe Adaptive Control With Meta-Bayesian Learning [33.75998206184497]
We develop a novel adaptive safe control framework that integrates meta learning, Bayesian models, and control barrier function (CBF) method. Specifically, with the help of CBF method, we learn the inherent and external uncertainties by a unified adaptive Bayesian linear regression model. For a new control task, we refine the meta-learned models using a few samples, and introduce pessimistic confidence bounds into CBF constraints to ensure safe control.
arXiv Detail & Related papers (2023-07-03T08:16:01Z)
Risk-Averse Model Uncertainty for Distributionally Robust Safe Reinforcement Learning [3.9821399546174825]
We introduce a deep reinforcement learning framework for safe decision making in uncertain environments. We provide robustness guarantees for this framework by showing it is equivalent to a specific class of distributionally robust safe reinforcement learning problems. In experiments on continuous control tasks with safety constraints, we demonstrate that our framework produces robust performance and safety at deployment time across a range of perturbed test environments.
arXiv Detail & Related papers (2023-01-30T00:37:06Z)
On Provable Backdoor Defense in Collaborative Learning [35.22450536986004]
Malicious users can upload data to prevent the model's convergence or inject hidden backdoors. Backdoor attacks are especially difficult to detect since the model behaves normally on standard test data but gives wrong outputs when triggered by certain backdoor keys. We propose a novel framework that generalizes existing subset aggregation methods.
arXiv Detail & Related papers (2021-01-19T14:39:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.