Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
- URL: http://arxiv.org/abs/2410.10871v1
- Date: Tue, 08 Oct 2024 13:42:36 GMT
- Title: Applying Refusal-Vector Ablation to Llama 3.1 70B Agents
- Authors: Simon Lermen, Mateusz Dziemian, Govind Pimpale,
- Abstract summary: In this study, we apply refusal-vector ablation to Llama 3.1 70B and implement a simple agent scaffolding to create an unrestricted agent.
Our findings imply that these refusal-vector ablated models can successfully complete harmful tasks, such as bribing officials or crafting phishing attacks.
Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3.1 Instruct models are willing to perform most harmful tasks without modifications.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, language models like Llama 3.1 Instruct have become increasingly capable of agentic behavior, enabling them to perform tasks requiring short-term planning and tool use. In this study, we apply refusal-vector ablation to Llama 3.1 70B and implement a simple agent scaffolding to create an unrestricted agent. Our findings imply that these refusal-vector ablated models can successfully complete harmful tasks, such as bribing officials or crafting phishing attacks, revealing significant vulnerabilities in current safety mechanisms. To further explore this, we introduce a small Safe Agent Benchmark, designed to test both harmful and benign tasks in agentic scenarios. Our results imply that safety fine-tuning in chat models does not generalize well to agentic behavior, as we find that Llama 3.1 Instruct models are willing to perform most harmful tasks without modifications. At the same time, these models will refuse to give advice on how to perform the same tasks when asked for a chat completion. This highlights the growing risk of misuse as models become more capable, underscoring the need for improved safety frameworks for language model agents.
Related papers
- Learning When to Act or Refuse: Guarding Agentic Reasoning Models for Safe Multi-Step Tool Use [6.622648583261088]
Agentic language models must plan, call tools, and execute long-horizon actions where a single misstep can cause irreversible harm.<n>We introduce MOSAIC, a framework that aligns agents for safe multi-step tool use by making safety decisions explicit and learnable.<n>We show that MOSAIC reduces harmful behavior by up to 50%, increases harmful-task refusal by over 20% on injection attacks, cuts privacy leakage, and preserves or improves benign task performance.
arXiv Detail & Related papers (2026-03-03T17:59:35Z) - SkillJect: Automating Stealthy Skill-Based Prompt Injection for Coding Agents with Trace-Driven Closed-Loop Refinement [120.52289344734415]
We propose an automated framework for stealthy prompt injection tailored to agent skills.<n>The framework forms a closed loop with three agents: an Attack Agent that synthesizes injection skills under explicit stealth constraints, a Code Agent that executes tasks using the injected skills and an Evaluate Agent that logs action traces.<n>Our method consistently achieves high attack success rates under realistic settings.
arXiv Detail & Related papers (2026-02-15T16:09:48Z) - OMNI-LEAK: Orchestrator Multi-Agent Network Induced Data Leakage [59.3826294523924]
We investigate the security vulnerabilities of a popular multi-agent pattern known as the orchestrator setup.<n>We report the susceptibility of frontier models to different categories of attacks, finding that both reasoning and non-reasoning models are vulnerable.
arXiv Detail & Related papers (2026-02-13T21:32:32Z) - Steering Externalities: Benign Activation Steering Unintentionally Increases Jailbreak Risk for Large Language Models [62.16655896700062]
Activation steering is a technique to enhance the utility of Large Language Models (LLMs)<n>We show that it unintentionally introduces critical and under-explored safety risks.<n>Experiments reveal that these interventions act as a force multiplier, creating new vulnerabilities to jailbreaks and increasing attack success rates to over 80% on standard benchmarks.
arXiv Detail & Related papers (2026-02-03T12:32:35Z) - Sponge Tool Attack: Stealthy Denial-of-Efficiency against Tool-Augmented Agentic Reasoning [58.432996881401415]
Recent work augments large language models (LLMs) with external tools to enable agentic reasoning.<n>We propose Sponge Tool Attack (STA), which disrupts agentic reasoning solely by rewriting the input prompt.<n>STA generates benign-looking prompt rewrites from the original one with high semantic fidelity.
arXiv Detail & Related papers (2026-01-24T19:36:51Z) - Are Your Agents Upward Deceivers? [73.1073084327614]
Large Language Model (LLM)-based agents are increasingly used as autonomous subordinates that carry out tasks for users.<n>This raises the question of whether they may also engage in deception, similar to how individuals in human organizations lie to superiors to create a good image or avoid punishment.<n>We observe and define agentic upward deception, a phenomenon in which an agent facing environmental constraints conceals its failure and performs actions that were not requested without reporting.
arXiv Detail & Related papers (2025-12-04T14:47:05Z) - Check Yourself Before You Wreck Yourself: Selectively Quitting Improves LLM Agent Safety [2.7030665672026846]
Large Language Model (LLM) agents increasingly operate in complex environments with real-world consequences.<n>We propose using "quitting" as a simple yet effective behavioral mechanism for LLM agents to recognize and withdraw from situations where they lack confidence.
arXiv Detail & Related papers (2025-10-18T13:22:19Z) - Adversarial Reinforcement Learning for Large Language Model Agent Safety [20.704989548285372]
Large Language Model (LLM) agents can leverage tools like Google Search to complete complex tasks.<n>Current defense strategies rely on fine-tuning LLM agents on datasets of known attacks.<n>We propose Adversarial Reinforcement Learning for Agent Safety (ARLAS), a novel framework that leverages adversarial reinforcement learning (RL) by formulating the problem as a two-player zero-sum game.
arXiv Detail & Related papers (2025-10-06T23:09:18Z) - IPIGuard: A Novel Tool Dependency Graph-Based Defense Against Indirect Prompt Injection in LLM Agents [33.775221377823925]
Large language model (LLM) agents are widely deployed in real-world applications, where they leverage tools to retrieve and manipulate external data for complex tasks.<n>When interacting with untrusted data sources, tool responses may contain injected instructions that covertly influence agent behaviors and lead to malicious outcomes.<n>We propose a novel defensive task execution paradigm, called IPIGuard, to prevent malicious tool invocations at the source.
arXiv Detail & Related papers (2025-08-21T07:08:16Z) - Attractive Metadata Attack: Inducing LLM Agents to Invoke Malicious Tools [10.086284534400658]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex reasoning and decision-making by leveraging external tools.<n>We identify this as a new and stealthy threat surface that allows malicious tools to be preferentially selected by LLM agents.<n>We propose a black-box in-context learning framework that generates highly attractive but syntactically and semantically valid tool metadata.
arXiv Detail & Related papers (2025-08-04T06:38:59Z) - AgentArmor: Enforcing Program Analysis on Agent Runtime Trace to Defend Against Prompt Injection [8.266563350981984]
Large Language Model (LLM) agents offer a powerful new paradigm for solving various problems by combining natural language reasoning with the execution of external tools.<n>In this work, we propose a novel insight that treats the agent runtime traces as structured programs with analyzable semantics.<n>We present AgentArmor, a program analysis framework that converts agent traces into graph intermediate representation-based structured program dependency representations.
arXiv Detail & Related papers (2025-08-02T07:59:34Z) - Design Patterns for Securing LLM Agents against Prompt Injections [26.519964636138585]
prompt injection attacks exploit the agent's resilience on natural language inputs.<n>We propose a set of principled design patterns for building AI agents with provable resistance to prompt injection.
arXiv Detail & Related papers (2025-06-10T14:23:55Z) - AgentAlign: Navigating Safety Alignment in the Shift from Informative to Agentic Large Language Models [23.916663925674737]
Previous work has shown that current LLM-based agents execute numerous malicious tasks even without being attacked.<n>We propose AgentAlign, a novel framework that leverages abstract behavior chains as a medium for safety alignment data synthesis.<n>Our framework enables the generation of highly authentic and executable instructions while capturing complex multi-step dynamics.
arXiv Detail & Related papers (2025-05-29T03:02:18Z) - Think Before Refusal : Triggering Safety Reflection in LLMs to Mitigate False Refusal Behavior [59.20260988638777]
We demonstrate that prompting safety reflection before generating a response can mitigate false refusal behavior.
In an ablation study across 15 pre-trained models, we show that models fine-tuned with safety reflection significantly reduce false refusal behavior.
arXiv Detail & Related papers (2025-03-22T23:35:49Z) - MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents [60.30753230776882]
LLM agents are vulnerable to indirect prompt injection (IPI) attacks, where malicious tasks embedded in tool-retrieved information can redirect the agent to take unauthorized actions.<n>We present MELON, a novel IPI defense that detects attacks by re-executing the agent's trajectory with a masked user prompt modified through a masking function.
arXiv Detail & Related papers (2025-02-07T18:57:49Z) - Towards Action Hijacking of Large Language Model-based Agent [39.19067800226033]
We introduce Name, a novel hijacking attack to manipulate the action plans of black-box agent system.
Our approach achieved an average bypass rate of 92.7% for safety filters.
arXiv Detail & Related papers (2024-12-14T12:11:26Z) - Steering Language Model Refusal with Sparse Autoencoders [16.78963326253821]
We identify and steer features in Phi-3 Mini that mediate refusal behavior.
We find that feature steering can improve Phi-3 Minis robustness to jailbreak attempts across various harms.
However, feature steering can adversely affect overall performance on benchmarks.
arXiv Detail & Related papers (2024-11-18T05:47:02Z) - Breaking ReAct Agents: Foot-in-the-Door Attack Will Get You In [5.65782619470663]
We examine how ReAct agents can be exploited using a straightforward yet effective method we refer to as the foot-in-the-door attack.
Our experiments show that indirect prompt injection attacks can significantly increase the likelihood of the agent performing subsequent malicious actions.
To mitigate this vulnerability, we propose implementing a simple reflection mechanism that prompts the agent to reassess the safety of its actions during execution.
arXiv Detail & Related papers (2024-10-22T12:24:41Z) - AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents [84.96249955105777]
LLM agents may pose a greater risk if misused, but their robustness remains underexplored.
We propose a new benchmark called AgentHarm to facilitate research on LLM agent misuse.
We find leading LLMs are surprisingly compliant with malicious agent requests without jailbreaking.
arXiv Detail & Related papers (2024-10-11T17:39:22Z) - What Makes and Breaks Safety Fine-tuning? A Mechanistic Study [64.9691741899956]
Safety fine-tuning helps align Large Language Models (LLMs) with human preferences for their safe deployment.
We design a synthetic data generation framework that captures salient aspects of an unsafe input.
Using this, we investigate three well-known safety fine-tuning methods.
arXiv Detail & Related papers (2024-07-14T16:12:57Z) - BEEAR: Embedding-based Adversarial Removal of Safety Backdoors in Instruction-tuned Language Models [57.5404308854535]
Safety backdoor attacks in large language models (LLMs) enable the stealthy triggering of unsafe behaviors while evading detection during normal interactions.
We present BEEAR, a mitigation approach leveraging the insight that backdoor triggers induce relatively uniform drifts in the model's embedding space.
Our bi-level optimization method identifies universal embedding perturbations that elicit unwanted behaviors and adjusts the model parameters to reinforce safe behaviors against these perturbations.
arXiv Detail & Related papers (2024-06-24T19:29:47Z) - Model for Peanuts: Hijacking ML Models without Training Access is Possible [5.005171792255858]
Model hijacking is an attack where an adversary aims to hijack a victim model to execute a different task than its original one.
We propose a simple approach for model hijacking at inference time named SnatchML to classify unknown input samples.
We first propose a novel approach we call meta-unlearning, designed to help the model unlearn a potentially malicious task while training on the original dataset.
arXiv Detail & Related papers (2024-06-03T18:04:37Z) - Query-Based Adversarial Prompt Generation [67.238873588125]
We build adversarial examples that cause an aligned language model to emit harmful strings.
We validate our attack on GPT-3.5 and OpenAI's safety classifier.
arXiv Detail & Related papers (2024-02-19T18:01:36Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z) - Navigating the OverKill in Large Language Models [84.62340510027042]
We investigate the factors for overkill by exploring how models handle and determine the safety of queries.
Our findings reveal the presence of shortcuts within models, leading to an over-attention of harmful words like 'kill' and prompts emphasizing safety will exacerbate overkill.
We introduce Self-Contrastive Decoding (Self-CD), a training-free and model-agnostic strategy, to alleviate this phenomenon.
arXiv Detail & Related papers (2024-01-31T07:26:47Z) - AGI Agent Safety by Iteratively Improving the Utility Function [0.0]
We present an AGI safety layer that creates a special dedicated input terminal to support the iterative improvement of an AGI agent's utility function.
We show ongoing work in mapping it to a Causal Influence Diagram (CID)
We then present the design of a learning agent, a design that wraps the safety layer around either a known machine learning system, or a potential future AGI-level learning system.
arXiv Detail & Related papers (2020-07-10T14:30:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.