DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs
- URL: http://arxiv.org/abs/2411.19038v1
- Date: Thu, 28 Nov 2024 10:33:11 GMT
- Title: DIESEL -- Dynamic Inference-Guidance via Evasion of Semantic Embeddings in LLMs
- Authors: Ben Ganon, Alon Zolfi, Omer Hofman, Inderjeet Singh, Hisashi Kojima, Yuval Elovici, Asaf Shabtai,
- Abstract summary: DIESEL is a lightweight inference technique that can be seamlessly integrated into any autoregressive LLM.<n>It enhances response safety by reranking the LLM's proposed tokens based on their similarity to predefined negative concepts in the latent space.
- Score: 23.441711206966914
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, conversational large language models (LLMs) have shown tremendous success in tasks such as casual conversation, question answering, and personalized dialogue, making significant advancements in domains like virtual assistance, social interaction, and online customer engagement. However, they often generate responses that are not aligned with human values (e.g., ethical standards, safety, or social norms), leading to potentially unsafe or inappropriate outputs. While several techniques have been proposed to address this problem, they come with a cost, requiring computationally expensive training or dramatically increasing the inference time. In this paper, we present DIESEL, a lightweight inference guidance technique that can be seamlessly integrated into any autoregressive LLM to semantically filter undesired concepts from the response. DIESEL can function either as a standalone safeguard or as an additional layer of defense, enhancing response safety by reranking the LLM's proposed tokens based on their similarity to predefined negative concepts in the latent space. This approach provides an efficient and effective solution for maintaining alignment with human values. Our evaluation demonstrates DIESEL's effectiveness on state-of-the-art conversational models (e.g., Llama 3), even in challenging jailbreaking scenarios that test the limits of response safety. We further show that DIESEL can be generalized to use cases other than safety, providing a versatile solution for general-purpose response filtering with minimal computational overhead.
Related papers
- `For Argument's Sake, Show Me How to Harm Myself!': Jailbreaking LLMs in Suicide and Self-Harm Contexts [0.0]
We present two new test cases in mental health for (i) suicide and (ii) self-harm, using multi-step, prompt-level jailbreaking and bypass built-in content and safety filters.<n>We show that user intent is disregarded, leading to the generation of detailed harmful content and instructions that could cause real-world harm.
arXiv Detail & Related papers (2025-07-01T18:00:04Z) - LoX: Low-Rank Extrapolation Robustifies LLM Safety Against Fine-tuning [61.594212398272184]
Low-Rank Extrapolation (LoX) improves robustness against benign and malicious fine-tuning attacks.<n>LoX leads to 11% to 54% absolute reductions in attack success rates.
arXiv Detail & Related papers (2025-06-18T16:30:02Z) - ROSE: Toward Reality-Oriented Safety Evaluation of Large Language Models [60.28667314609623]
Large Language Models (LLMs) are increasingly deployed as black-box components in real-world applications.<n>We propose Reality-Oriented Safety Evaluation (ROSE), a novel framework that uses multi-objective reinforcement learning to fine-tune an adversarial LLM.
arXiv Detail & Related papers (2025-06-17T10:55:17Z) - ReGA: Representation-Guided Abstraction for Model-based Safeguarding of LLMs [0.9285458070502282]
Large Language Models (LLMs) have achieved significant success in various tasks, yet concerns about their safety and security have emerged.<n>To analyze and monitor machine learning models, model-based analysis has demonstrated notable potential in stateful deep neural networks.<n>We propose ReGA, a model-based analysis framework with representation-guided abstraction, to safeguard LLMs against harmful prompts and generations.
arXiv Detail & Related papers (2025-06-02T15:17:38Z) - One Trigger Token Is Enough: A Defense Strategy for Balancing Safety and Usability in Large Language Models [20.42976162135529]
Large Language Models (LLMs) have been extensively used across diverse domains, including virtual assistants, automated code generation, and scientific research.<n>We propose textttD-STT, a simple yet effective defense algorithm that identifies and explicitly decodes safety trigger tokens of the given safety-aligned LLM.
arXiv Detail & Related papers (2025-05-12T01:26:50Z) - $\texttt{SAGE}$: A Generic Framework for LLM Safety Evaluation [9.935219917903858]
This paper introduces the $texttSAGE$ (Safety AI Generic Evaluation) framework.
$texttSAGE$ is an automated modular framework designed for customized and dynamic harm evaluations.
Our experiments with multi-turn conversational evaluations revealed a concerning finding that harm steadily increases with conversation length.
arXiv Detail & Related papers (2025-04-28T11:01:08Z) - Align in Depth: Defending Jailbreak Attacks via Progressive Answer Detoxification [17.500701903902094]
Large Language Models (LLMs) are vulnerable to jailbreak attacks, which use crafted prompts to elicit toxic responses.
This paper proposes DEEPALIGN, a robust defense framework that fine-tunes LLMs to progressively detoxify generated content.
arXiv Detail & Related papers (2025-03-14T08:32:12Z) - Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking [26.812138599896997]
We propose Reasoning-to-Defend (R2D), a novel training paradigm that integrates safety reflections of queries and responses into LLM's generation process.
R2D effectively mitigates various attacks and improves overall safety, highlighting the substantial potential of safety-aware reasoning in strengthening LLMs' robustness against jailbreaks.
arXiv Detail & Related papers (2025-02-18T15:48:46Z) - Reasoning-Augmented Conversation for Multi-Turn Jailbreak Attacks on Large Language Models [53.580928907886324]
Reasoning-Augmented Conversation is a novel multi-turn jailbreak framework.
It reformulates harmful queries into benign reasoning tasks.
We show that RACE achieves state-of-the-art attack effectiveness in complex conversational scenarios.
arXiv Detail & Related papers (2025-02-16T09:27:44Z) - Almost Surely Safe Alignment of Large Language Models at Inference-Time [20.5164976103514]
Even highly capable large language models (LLMs) can produce biased or unsafe responses.
This paper introduces a novel inference-time alignment approach.
We achieve this by framing the safe generation of inference-time responses as a constrained Markov decision process.
arXiv Detail & Related papers (2025-02-03T09:59:32Z) - Smoothed Embeddings for Robust Language Models [11.97873981355746]
Large language models (LLMs) are vulnerable to jailbreaking attacks that subvert alignment and induce harmful outputs.
We propose the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense, which adds random noise to the embedding vectors and performs aggregation during the generation of each output token.
Our experiments demonstrate that our approach achieves superior robustness versus utility tradeoffs compared to the baseline defenses.
arXiv Detail & Related papers (2025-01-27T20:57:26Z) - Optimizing Large Language Models for Dynamic Constraints through Human-in-the-Loop Discriminators [0.0]
Large Language Models (LLMs) have recently demonstrated impressive capabilities across various real-world applications.
We propose a flexible framework that enables LLMs to interact with system interfaces, summarize constraint concepts, and continually optimize performance metrics.
Our framework achieved a $7.78%$ pass rate with the human discriminator and a $6.11%$ pass rate with the LLM-based discriminator.
arXiv Detail & Related papers (2024-10-19T17:27:38Z) - Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution [1.8814321586521556]
"Survival of the Safest" (SoS) is an innovative multi-objective prompt optimization framework.
It enhances both performance and security in large language models simultaneously.
SoS provides a scalable solution that expedites optimization in complex, high-dimensional discrete search spaces.
arXiv Detail & Related papers (2024-10-12T21:16:29Z) - ETA: Evaluating Then Aligning Safety of Vision Language Models at Inference Time [12.160713548659457]
adversarial visual inputs can easily bypass VLM defense mechanisms.
We propose a novel two-phase inference-time alignment framework, evaluating input visual contents and output responses.
Experiments show that ETA outperforms baseline methods in terms of harmlessness, helpfulness, and efficiency.
arXiv Detail & Related papers (2024-10-09T07:21:43Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs)
We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z) - Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models? [2.089112028396727]
This study explores whether Large Language Models can produce safe, unbiased outputs without sacrificing knowledge or comprehension.
We introduce the Safe and Responsible Large Language Model (textbfSR$_textLLM$), which has been instruction fine-tuned atop an inherently safe fine-tuned LLM.
Experiments reveal that textbfSR$_textLLM$ effectively reduces biases while preserving knowledge integrity.
arXiv Detail & Related papers (2024-04-01T18:10:05Z) - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs)
It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z) - On Prompt-Driven Safeguarding for Large Language Models [172.13943777203377]
We find that in the representation space, the input queries are typically moved by safety prompts in a "higher-refusal" direction.
Inspired by these findings, we propose a method for safety prompt optimization, namely DRO.
Treating a safety prompt as continuous, trainable embeddings, DRO learns to move the queries' representations along or opposite the refusal direction, depending on their harmfulness.
arXiv Detail & Related papers (2024-01-31T17:28:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.