Related papers: Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense

Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense

URL: http://arxiv.org/abs/2502.00840v1
Date: Sun, 02 Feb 2025 16:25:48 GMT
Title: Activation Approximations Can Incur Safety Vulnerabilities Even in Aligned LLMs: Comprehensive Analysis and Defense
Authors: Jiawen Zhang, Kejia Chen, Lipeng He, Jian Lou, Dan Li, Zunlei Feng, Mingli Song, Jian Liu, Kui Ren, Xiaohu Yang,
Abstract summary: Large Language Models (LLMs) have showcased remarkable capabilities across various domains.<n> activation approximation has emerged as a promising avenue for pursuing inference efficiency.<n>Despite achieving substantial speedups with minimal impact on utility, the safety implications of activation approximations remain unclear.
Score: 44.01174462291761
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have showcased remarkable capabilities across various domains. Accompanying the evolving capabilities and expanding deployment scenarios of LLMs, their deployment challenges escalate due to their sheer scale and the advanced yet complex activation designs prevalent in notable model series, such as Llama, Gemma, and Mistral. These challenges have become particularly pronounced in resource-constrained deployment scenarios, where mitigating inference efficiency bottlenecks is imperative. Among various recent efforts, activation approximation has emerged as a promising avenue for pursuing inference efficiency, sometimes considered indispensable in applications such as private inference. Despite achieving substantial speedups with minimal impact on utility, even appearing sound and practical for real-world deployment, the safety implications of activation approximations remain unclear. In this work, we fill this critical gap in LLM safety by conducting the first systematic safety evaluation of activation approximations. Our safety vetting spans seven sota techniques across three popular categories, revealing consistent safety degradation across ten safety-aligned LLMs.

Related papers

Representation Bending for Large Language Model Safety [27.842146980762934]
Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks pose significant challenges. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs. RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates.
arXiv Detail & Related papers (2025-04-02T09:47:01Z)
Exposing the Ghost in the Transformer: Abnormal Detection for Large Language Models via Hidden State Forensics [5.384257830522198]
Large Language Models (LLMs) in critical applications have introduced severe reliability and security risks. These vulnerabilities have been weaponized by malicious actors, leading to unauthorized access, widespread misinformation, and compromised system integrity. We introduce a novel approach to detecting abnormal behaviors in LLMs via hidden state forensics.
arXiv Detail & Related papers (2025-04-01T05:58:14Z)
Fundamental Safety-Capability Trade-offs in Fine-tuning Large Language Models [92.38300626647342]
Fine-tuning Large Language Models (LLMs) on some task-specific datasets has been a primary use of LLMs. This paper presents a theoretical framework for understanding the interplay between safety and capability in two primary safety-aware LLM fine-tuning strategies.
arXiv Detail & Related papers (2025-03-24T20:41:57Z)
You Can't Eat Your Cake and Have It Too: The Performance Degradation of LLMs with Jailbreak Defense [34.023473699165315]
We study the utility degradation, safety elevation, and exaggerated-safety escalation of LLMs with jailbreak defense strategies.<n>We find that mainstream jailbreak defenses fail to ensure both safety and performance simultaneously.
arXiv Detail & Related papers (2025-01-21T15:24:29Z)
Active Learning for Robust and Representative LLM Generation in Safety-Critical Scenarios [32.16984263644299]
Large Language Models (LLMs) can generate valuable data for safety measures, but often exhibit distributional biases. We propose a novel framework that integrates active learning with clustering to guide LLM generation. Our results show that the proposed framework produces a more representative set of safety scenarios without requiring prior knowledge of the underlying data distribution.
arXiv Detail & Related papers (2024-10-14T21:48:14Z)
SCANS: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large Language Models (LLMs) to defend threats from malicious instructions.<n>Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.<n>We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z)
Current state of LLM Risks and AI Guardrails [0.0]
Large language models (LLMs) have become increasingly sophisticated, leading to widespread deployment in sensitive applications where safety and reliability are paramount. These risks necessitate the development of "guardrails" to align LLMs with desired behaviors and mitigate potential harm. This work explores the risks associated with deploying LLMs and evaluates current approaches to implementing guardrails and model alignment techniques.
arXiv Detail & Related papers (2024-06-16T22:04:10Z)
Uncovering Safety Risks of Large Language Models through Concept Activation Vector [13.804245297233454]
We introduce a Safety Concept Activation Vector (SCAV) framework to guide attacks on large language models (LLMs)<n>We then develop an SCAV-guided attack method that can generate both attack prompts and embedding-level attacks.<n>Our attack method significantly improves the attack success rate and response quality while requiring less training data.
arXiv Detail & Related papers (2024-04-18T09:46:25Z)
Fine-Tuning, Quantization, and LLMs: Navigating Unintended Outcomes [0.0]
Large Language Models (LLMs) have gained widespread adoption across various domains, including chatbots and auto-task completion agents. These models are susceptible to safety vulnerabilities such as jailbreaking, prompt injection, and privacy leakage attacks. This study investigates the impact of these modifications on LLM safety, a critical consideration for building reliable and secure AI systems.
arXiv Detail & Related papers (2024-04-05T20:31:45Z)
RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks. This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z)
Uniformly Safe RL with Objective Suppression for Multi-Constraint Safety-Critical Applications [73.58451824894568]
The widely adopted CMDP model constrains the risks in expectation, which makes room for dangerous behaviors in long-tail states. In safety-critical domains, such behaviors could lead to disastrous outcomes. We propose Objective Suppression, a novel method that adaptively suppresses the task reward maximizing objectives according to a safety critic.
arXiv Detail & Related papers (2024-02-23T23:22:06Z)
Assessing the Brittleness of Safety Alignment via Pruning and Low-Rank Modifications [69.13807233595455]
Large language models (LLMs) show inherent brittleness in their safety mechanisms. This study explores this brittleness of safety alignment by leveraging pruning and low-rank modifications. We show that LLMs remain vulnerable to low-cost fine-tuning attacks even when modifications to the safety-critical regions are restricted.
arXiv Detail & Related papers (2024-02-07T18:34:38Z)
Lyapunov-based uncertainty-aware safe reinforcement learning [0.0]
InReinforcement learning (RL) has shown a promising performance in learning optimal policies for a variety of sequential decision-making tasks. In many real-world RL problems, besides optimizing the main objectives, the agent is expected to satisfy a certain level of safety. We propose a Lyapunov-based uncertainty-aware safe RL model to address these limitations.
arXiv Detail & Related papers (2021-07-29T13:08:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.