Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models
- URL: http://arxiv.org/abs/2404.01295v1
- Date: Mon, 1 Apr 2024 17:59:06 GMT
- Title: Towards Safety and Helpfulness Balanced Responses via Controllable Large Language Models
- Authors: Yi-Lin Tuan, Xilun Chen, Eric Michael Smith, Louis Martin, Soumya Batra, Asli Celikyilmaz, William Yang Wang, Daniel M. Bikel,
- Abstract summary: A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm.
We propose to balance safety and helpfulness in diverse use cases by controlling both attributes in large language models.
- Score: 64.5204594279587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: As large language models (LLMs) become easily accessible nowadays, the trade-off between safety and helpfulness can significantly impact user experience. A model that prioritizes safety will cause users to feel less engaged and assisted while prioritizing helpfulness will potentially cause harm. Possible harms include teaching people how to build a bomb, exposing youth to inappropriate content, and hurting users' mental health. In this work, we propose to balance safety and helpfulness in diverse use cases by controlling both attributes in LLM. We explore training-free and fine-tuning methods that do not require extra human annotations and analyze the challenges of controlling safety and helpfulness in LLMs. Our experiments demonstrate that our method can rewind a learned model and unlock its controllability.
Related papers
- Defining and Evaluating Physical Safety for Large Language Models [62.4971588282174]
Large Language Models (LLMs) are increasingly used to control robotic systems such as drones.
Their risks of causing physical threats and harm in real-world applications remain unexplored.
We classify the physical safety risks of drones into four categories: (1) human-targeted threats, (2) object-targeted threats, (3) infrastructure attacks, and (4) regulatory violations.
arXiv Detail & Related papers (2024-11-04T17:41:25Z) - An Adversarial Perspective on Machine Unlearning for AI Safety [22.639683142004372]
This work challenges the fundamental differences between unlearning and traditional safety post-training.
We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully.
For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU.
arXiv Detail & Related papers (2024-09-26T16:32:19Z) - Bi-Factorial Preference Optimization: Balancing Safety-Helpfulness in Language Models [94.39278422567955]
Fine-tuning large language models (LLMs) on human preferences has proven successful in enhancing their capabilities.
However, ensuring the safety of LLMs during the fine-tuning remains a critical concern.
We propose a supervised learning framework called Bi-Factorial Preference Optimization (BFPO) to address this issue.
arXiv Detail & Related papers (2024-08-27T17:31:21Z) - Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training [67.30423823744506]
This study addresses a critical gap in safety tuning practices for Large Language Models (LLMs)
We introduce a novel approach, Decoupled Refusal Training (DeRTa), designed to empower LLMs to refuse compliance to harmful prompts at any response position.
DeRTa incorporates two novel components: (1) Maximum Likelihood Estimation with Harmful Response Prefix, which trains models to recognize and avoid unsafe content by appending a segment of harmful response to the beginning of a safe response, and (2) Reinforced Transition Optimization (RTO), which equips models with the ability to transition from potential harm to safety refusal consistently throughout the harmful
arXiv Detail & Related papers (2024-07-12T09:36:33Z) - Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations [19.132597762214722]
Current alignment methods struggle with dynamic user intentions and complex objectives.
We propose Safety Arithmetic, a training-free framework enhancing safety across different scenarios.
Our experiments show that Safety Arithmetic significantly improves safety measures, reduces over-safety, and maintains model utility.
arXiv Detail & Related papers (2024-06-17T17:48:13Z) - SLM as Guardian: Pioneering AI Safety with Small Language Models [6.799423428734095]
Internalizing safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness.
In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation.
We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.
arXiv Detail & Related papers (2024-05-30T08:03:15Z) - ALERT: A Comprehensive Benchmark for Assessing Large Language Models' Safety through Red Teaming [64.86326523181553]
ALERT is a large-scale benchmark to assess safety based on a novel fine-grained risk taxonomy.
It aims to identify vulnerabilities, inform improvements, and enhance the overall safety of the language models.
arXiv Detail & Related papers (2024-04-06T15:01:47Z) - Self-Guard: Empower the LLM to Safeguard Itself [33.2186340694417]
There are two main approaches to address jailbreak attacks: safety training and safeguards.
We propose a novel approach called Self-Guard, which combines the strengths of both safety methods.
The experiment has demonstrated that Self-Guard is robust against jailbreak attacks.
arXiv Detail & Related papers (2023-10-24T14:08:26Z) - Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions [79.1824160877979]
We show that several popular instruction-tuned models are highly unsafe.
Our safety-tuning does not make models significantly less capable or helpful as measured by standard benchmarks.
arXiv Detail & Related papers (2023-09-14T17:23:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.