Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment
- URL: http://arxiv.org/abs/2502.04040v1
- Date: Thu, 06 Feb 2025 13:01:44 GMT
- Title: Leveraging Reasoning with Guidelines to Elicit and Utilize Knowledge for Enhancing Safety Alignment
- Authors: Haoyu Wang, Zeyu Qin, Li Shen, Xueqian Wang, Minhao Cheng, Dacheng Tao,
- Abstract summary: Refusal Training (RT) struggles to generalize against various OOD jailbreaking attacks.
We observe significant improvements on generalization as N increases.
We propose training model to perform safety reasoning for each query.
- Score: 63.15719512614899
- License:
- Abstract: Training safe LLMs is one of the most critical research challenge. However, the commonly used method, Refusal Training (RT), struggles to generalize against various OOD jailbreaking attacks. Many safety training methods have been proposed to address this issue. While they offer valuable insights, we aim to complement this line of research by investigating whether OOD attacks truly exceed the capability of RT model. Conducting evaluation with BoN, we observe significant improvements on generalization as N increases. This underscores that the model possesses sufficient safety-related latent knowledge, but RT fails to consistently elicit this knowledge when addressing OOD attacks. Further analysis based on domain adaptation reveals that training with direct refusal causes model to rely on superficial shortcuts, resulting in learning of non-robust representation mappings. Based on our findings, we propose training model to perform safety reasoning for each query. Reasoning supervision encourages model to perform more computations, explicitly eliciting and using latent knowledge through reasoning. To achieve this, we synthesize reasoning supervision based on pre-guidelines, training the model to reason in alignment with them, thereby effectively eliciting and utilizing latent knowledge from diverse perspectives. Extensive experiments show that our method significantly improves generalization performance against OOD attacks.
Related papers
- Are Smarter LLMs Safer? Exploring Safety-Reasoning Trade-offs in Prompting and Fine-Tuning [40.55486479495965]
Large Language Models (LLMs) have demonstrated remarkable success across various NLP benchmarks.
In this work, we investigate the interplay between reasoning and safety in LLMs.
We highlight the latent safety risks that arise as reasoning capabilities improve, shedding light on previously overlooked vulnerabilities.
arXiv Detail & Related papers (2025-02-13T06:37:28Z) - Vulnerability Mitigation for Safety-Aligned Language Models via Debiasing [12.986006070964772]
Safety alignment is an essential research topic for real-world AI applications.
Our study first identified the difficulty of eliminating such vulnerabilities without sacrificing the model's helpfulness.
Our method could enhance the model's helpfulness while maintaining safety, thus improving the trade-off-front.
arXiv Detail & Related papers (2025-02-04T09:31:54Z) - OpenAI o1 System Card [274.83891368890977]
The o1 model series is trained with large-scale reinforcement learning to reason using chain of thought.
This report outlines the safety work carried out for the OpenAI o1 and OpenAI o1-mini models, including safety evaluations, external red teaming, and Preparedness Framework evaluations.
arXiv Detail & Related papers (2024-12-21T18:04:31Z) - Safe to Serve: Aligning Instruction-Tuned Models for Safety and Helpfulness [0.0]
Large language models (LLMs) have demonstrated remarkable capabilities in complex reasoning and text generation.
LLMs can inadvertently generate unsafe or biased responses when prompted with problematic inputs.
This research addresses the critical challenge of developing language models that generate both helpful and harmless content.
arXiv Detail & Related papers (2024-11-26T06:52:22Z) - Unveiling the Misuse Potential of Base Large Language Models via In-Context Learning [61.2224355547598]
Open-sourcing of large language models (LLMs) accelerates application development, innovation, and scientific progress.
Our investigation exposes a critical oversight in this belief.
By deploying carefully designed demonstrations, our research demonstrates that base LLMs could effectively interpret and execute malicious instructions.
arXiv Detail & Related papers (2024-04-16T13:22:54Z) - Unlearning Backdoor Threats: Enhancing Backdoor Defense in Multimodal Contrastive Learning via Local Token Unlearning [49.242828934501986]
Multimodal contrastive learning has emerged as a powerful paradigm for building high-quality features.
backdoor attacks subtly embed malicious behaviors within the model during training.
We introduce an innovative token-based localized forgetting training regime.
arXiv Detail & Related papers (2024-03-24T18:33:15Z) - The Art of Defending: A Systematic Evaluation and Analysis of LLM
Defense Strategies on Safety and Over-Defensiveness [56.174255970895466]
Large Language Models (LLMs) play an increasingly pivotal role in natural language processing applications.
This paper presents Safety and Over-Defensiveness Evaluation (SODE) benchmark.
arXiv Detail & Related papers (2023-12-30T17:37:06Z) - Evaluating the Safety of Deep Reinforcement Learning Models using
Semi-Formal Verification [81.32981236437395]
We present a semi-formal verification approach for decision-making tasks based on interval analysis.
Our method obtains comparable results over standard benchmarks with respect to formal verifiers.
Our approach allows to efficiently evaluate safety properties for decision-making models in practical applications.
arXiv Detail & Related papers (2020-10-19T11:18:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.