Smoothed Embeddings for Robust Language Models
- URL: http://arxiv.org/abs/2501.16497v1
- Date: Mon, 27 Jan 2025 20:57:26 GMT
- Title: Smoothed Embeddings for Robust Language Models
- Authors: Ryo Hase, Md Rafi Ur Rashid, Ashley Lewis, Jing Liu, Toshiaki Koike-Akino, Kieran Parsons, Ye Wang,
- Abstract summary: Large language models (LLMs) are vulnerable to jailbreaking attacks that subvert alignment and induce harmful outputs.<n>We propose the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense, which adds random noise to the embedding vectors and performs aggregation during the generation of each output token.<n>Our experiments demonstrate that our approach achieves superior robustness versus utility tradeoffs compared to the baseline defenses.
- Score: 11.97873981355746
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Improving the safety and reliability of large language models (LLMs) is a crucial aspect of realizing trustworthy AI systems. Although alignment methods aim to suppress harmful content generation, LLMs are often still vulnerable to jailbreaking attacks that employ adversarial inputs that subvert alignment and induce harmful outputs. We propose the Randomized Embedding Smoothing and Token Aggregation (RESTA) defense, which adds random noise to the embedding vectors and performs aggregation during the generation of each output token, with the aim of better preserving semantic information. Our experiments demonstrate that our approach achieves superior robustness versus utility tradeoffs compared to the baseline defenses.
Related papers
- Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited.
We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z) - CeTAD: Towards Certified Toxicity-Aware Distance in Vision Language Models [16.5022773312661]
We propose a universal certified defence framework to safeguard large vision-language models against jailbreak attacks.
First, we proposed a novel distance metric to quantify semantic discrepancies between malicious and intended responses.
Then, we devise a regressed certification approach that employs randomized smoothing to provide formal robustness guarantees.
arXiv Detail & Related papers (2025-03-08T17:33:55Z) - Improving LLM Safety Alignment with Dual-Objective Optimization [65.41451412400609]
Existing training-time safety alignment techniques for large language models (LLMs) remain vulnerable to jailbreak attacks.
We propose an improved safety alignment that disentangles DPO objectives into two components: (1) robust refusal training, which encourages refusal even when partial unsafe generations are produced, and (2) targeted unlearning of harmful knowledge.
arXiv Detail & Related papers (2025-03-05T18:01:05Z) - A generative approach to LLM harmfulness detection with special red flag tokens [15.796683630119654]
We propose to expand the model's vocabulary with a special token we call red flag token (rf>)
This novel safety training method effectively augments LLMs into generative classifiers of harmfulness at all times during the conversation.
It also evaluates each generated answer rather than just the input prompt and provides a stronger defence against sampling-based attacks.
arXiv Detail & Related papers (2025-02-22T21:48:48Z) - Watching the AI Watchdogs: A Fairness and Robustness Analysis of AI Safety Moderation Classifiers [5.35599092568615]
Safety Moderation (ASM) classifiers are designed to moderate content on social media platforms.<n>It is crucial to ensure that these classifiers do not unfairly classify content belonging to users from minority groups.<n>We thus examine the fairness and robustness of four widely-used, closed-source ASM classifiers.
arXiv Detail & Related papers (2025-01-23T01:04:00Z) - Retention Score: Quantifying Jailbreak Risks for Vision Language Models [60.48306899271866]
Vision-Language Models (VLMs) are integrated with Large Language Models (LLMs) to enhance multi-modal machine learning capabilities.
This paper aims to assess the resilience of VLMs against jailbreak attacks that can compromise model safety compliance and result in harmful outputs.
To evaluate a VLM's ability to maintain its robustness against adversarial input perturbations, we propose a novel metric called the textbfRetention Score.
arXiv Detail & Related papers (2024-12-23T13:05:51Z) - Efficient Adversarial Training in LLMs with Continuous Attacks [99.5882845458567]
Large language models (LLMs) are vulnerable to adversarial attacks that can bypass their safety guardrails.
We propose a fast adversarial training algorithm (C-AdvUL) composed of two losses.
C-AdvIPO is an adversarial variant of IPO that does not require utility data for adversarially robust alignment.
arXiv Detail & Related papers (2024-05-24T14:20:09Z) - RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content [62.685566387625975]
Current mitigation strategies, while effective, are not resilient under adversarial attacks.
This paper introduces Resilient Guardrails for Large Language Models (RigorLLM), a novel framework designed to efficiently moderate harmful and unsafe inputs.
arXiv Detail & Related papers (2024-03-19T07:25:02Z) - Defending Large Language Models against Jailbreak Attacks via Semantic
Smoothing [107.97160023681184]
Aligned large language models (LLMs) are vulnerable to jailbreaking attacks.
We propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions of semantically transformed copies of a given input prompt.
arXiv Detail & Related papers (2024-02-25T20:36:03Z) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs)
It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.