Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
- URL: http://arxiv.org/abs/2312.06924v2
- Date: Thu, 6 Jun 2024 22:58:32 GMT
- Title: Safety Alignment in NLP Tasks: Weakly Aligned Summarization as an In-Context Attack
- Authors: Yu Fu, Yufei Li, Wen Xiao, Cong Liu, Yue Dong,
- Abstract summary: Are mainstream NLP tasks adequately aligned with safety consideration?
Our study reveals significant disparities in the safety alignment of various NLP tasks.
Attacks exploiting tasks with weaker safety alignment, like summarization, can potentially compromise the integrity of tasks traditionally deemed more robust.
- Score: 20.551730528019338
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent developments in balancing the usefulness and safety of Large Language Models (LLMs) have raised a critical question: Are mainstream NLP tasks adequately aligned with safety consideration? Our study, focusing on safety-sensitive documents obtained through adversarial attacks, reveals significant disparities in the safety alignment of various NLP tasks. For instance, LLMs can effectively summarize malicious long documents but often refuse to translate them. This discrepancy highlights a previously unidentified vulnerability: attacks exploiting tasks with weaker safety alignment, like summarization, can potentially compromise the integrity of tasks traditionally deemed more robust, such as translation and question-answering (QA). Moreover, the concurrent use of multiple NLP tasks with lesser safety alignment increases the risk of LLMs inadvertently processing harmful content. We demonstrate these vulnerabilities in various safety-aligned LLMs, particularly Llama2 models, Gemini and GPT-4, indicating an urgent need for strengthening safety alignments across a broad spectrum of NLP tasks.
Related papers
- SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models [75.67623347512368]
We propose toolns, a comprehensive framework designed for conducting safety evaluations of MLLMs.
Our framework consists of a comprehensive harmful query dataset and an automated evaluation protocol.
Based on our framework, we conducted large-scale experiments on 15 widely-used open-source MLLMs and 6 commercial MLLMs.
arXiv Detail & Related papers (2024-10-24T17:14:40Z) - Multimodal Situational Safety [73.63981779844916]
We present the first evaluation and analysis of a novel safety challenge termed Multimodal Situational Safety.
For an MLLM to respond safely, whether through language or action, it often needs to assess the safety implications of a language query within its corresponding visual context.
We develop the Multimodal Situational Safety benchmark (MSSBench) to assess the situational safety performance of current MLLMs.
arXiv Detail & Related papers (2024-10-08T16:16:07Z) - Superficial Safety Alignment Hypothesis [8.297367440457508]
We propose the Superficial Safety Alignment Hypothesis (SSAH), which posits that safety alignment should teach an otherwise unsafe model to choose the correct reasoning direction.
We identify four types of attribute-critical components in safety-aligned large language models (LLMs)
Our findings show that freezing certain safety-critical components 7.5% during fine-tuning allows the model to retain its safety attributes while adapting to new tasks.
arXiv Detail & Related papers (2024-10-07T19:53:35Z) - Multitask Mayhem: Unveiling and Mitigating Safety Gaps in LLMs Fine-tuning [1.3307486544794784]
Red teaming/Safety alignment efforts show that fine-tuning models on benign (non-harmful) data could compromise safety.
This paper explores the task-wise safety degradation due to fine-tuning on downstream tasks such as summarization, code generation, translation, and classification.
Our work underscores the need for generalized alignment measures to ensure safer and more robust models.
arXiv Detail & Related papers (2024-09-18T08:04:24Z) - Nothing in Excess: Mitigating the Exaggerated Safety for LLMs via Safety-Conscious Activation Steering [56.92068213969036]
Safety alignment is indispensable for Large language models (LLMs) to defend threats from malicious instructions.
Recent researches reveal safety-aligned LLMs prone to reject benign queries due to the exaggerated safety issue.
We propose a Safety-Conscious Activation Steering (SCANS) method to mitigate the exaggerated safety concerns.
arXiv Detail & Related papers (2024-08-21T10:01:34Z) - Cross-Task Defense: Instruction-Tuning LLMs for Content Safety [20.00136552026715]
Large Language Models (LLMs) face challenges in balancing safety with utility.
Despite defenses against malicious short questions, the ability of LLMs to safely handle dangerous long content, such as manuals teaching illicit activities, remains unclear.
We introduce a defense dataset comprised of safety-related examples and propose single-task and mixed-task losses for instruction tuning.
arXiv Detail & Related papers (2024-05-24T04:14:32Z) - Towards Comprehensive and Efficient Post Safety Alignment of Large Language Models via Safety Patching [77.36097118561057]
textscSafePatching is a novel framework for comprehensive and efficient PSA.
textscSafePatching achieves a more comprehensive and efficient PSA than baseline methods.
arXiv Detail & Related papers (2024-05-22T16:51:07Z) - SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models [107.82336341926134]
SALAD-Bench is a safety benchmark specifically designed for evaluating Large Language Models (LLMs)
It transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities.
arXiv Detail & Related papers (2024-02-07T17:33:54Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.