Related papers: Course-Correction: Safety Alignment Using Synthetic Preferences

Course-Correction: Safety Alignment Using Synthetic Preferences

URL: http://arxiv.org/abs/2407.16637v1
Date: Tue, 23 Jul 2024 16:54:28 GMT
Title: Course-Correction: Safety Alignment Using Synthetic Preferences
Authors: Rongwu Xu, Yishuo Cai, Zhenhong Zhou, Renjie Gu, Haiqin Weng, Yan Liu, Tianwei Zhang, Wei Xu, Han Qiu,
Abstract summary: We introduce the textscC$2$-Eval benchmark for quantitative assessment and analyze 10 popular language models. Using an automated pipeline, we create textscC$2$-Syn, a synthetic dataset with 750K pairwise preferences. Experiments on 2 LLMs, textscLlama2-Chat 7B and textscQwen2 7B, show that our method effectively enhances course-correction skills without affecting general performance.
Score: 17.897817682322053
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The risk of harmful content generated by large language models (LLMs) becomes a critical concern. This paper presents a systematic study on assessing and improving LLMs' capability to perform the task of \textbf{course-correction}, \ie, the model can steer away from generating harmful content autonomously. To start with, we introduce the \textsc{C$^2$-Eval} benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction. To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create \textsc{C$^2$-Syn}, a synthetic dataset with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven preference learning. Experiments on 2 LLMs, \textsc{Llama2-Chat 7B} and \textsc{Qwen2 7B}, show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs' safety, particularly in resisting jailbreak attacks.

Related papers

PITA: Preference-Guided Inference-Time Alignment for LLM Post-Training [9.093854840532062]
PITA is a novel framework that integrates preference feedback directly into the LLM's token generation.<n> PITA learns a small preference-based guidance policy to modify token probabilities at inference time without fine-tuning.<n>We evaluate PITA across diverse tasks, including mathematical reasoning and sentiment classification.
arXiv Detail & Related papers (2025-07-26T21:46:32Z)
ESLM: Risk-Averse Selective Language Modeling for Efficient Pretraining [53.893792844055106]
Large language model pretraining is compute-intensive, yet many tokens contribute marginally to learning, resulting in inefficiency.<n>We introduce Selective Efficient Language Modeling, a risk-aware algorithm that improves training efficiency and distributional robustness by performing online token-level batch selection.<n> Experiments on GPT-2 pretraining show that ESLM significantly reduces training FLOPs while maintaining or improving both perplexity and downstream performance compared to baselines.
arXiv Detail & Related papers (2025-05-26T12:23:26Z)
LLM Unlearning Reveals a Stronger-Than-Expected Coreset Effect in Current Benchmarks [23.5632914682956]
Large language model unlearning has become a critical challenge in ensuring safety and controlled model behavior. We show that LLM unlearning can be effectively maintained using a significantly smaller subset (functioning as a "coreset") This suggests that LLM unlearning in these benchmarks can be performed surprisingly easily, even in an extremely low-data regime.
arXiv Detail & Related papers (2025-04-14T12:38:37Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
RIDE: Enhancing Large Language Model Alignment through Restyled In-Context Learning Demonstration Exemplars [57.6513924960128]
Alignment tuning is crucial for ensuring large language models (LLMs) behave ethically and helpfully. This paper proposes a low-cost, tuning-free method using in-context learning (ICL) to enhance LLM alignment.
arXiv Detail & Related papers (2025-02-17T11:16:19Z)
Real-Time Personalization for LLM-based Recommendation with Customized In-Context Learning [57.28766250993726]
This work explores adapting to dynamic user interests without any model updates. Existing Large Language Model (LLM)-based recommenders often lose the in-context learning ability during recommendation tuning. We propose RecICL, which customizes recommendation-specific in-context learning for real-time recommendations.
arXiv Detail & Related papers (2024-10-30T15:48:36Z)
Attribute Controlled Fine-tuning for Large Language Models: A Case Study on Detoxification [76.14641982122696]
We propose a constraint learning schema for fine-tuning Large Language Models (LLMs) with attribute control. We show that our approach leads to an LLM that produces fewer inappropriate responses while achieving competitive performance on benchmarks and a toxicity detection task.
arXiv Detail & Related papers (2024-10-07T23:38:58Z)
Large Language Models can be Strong Self-Detoxifiers [82.6594169242814]
Self-disciplined Autoregressive Sampling (SASA) is a lightweight controlled decoding algorithm for toxicity reduction of large language models (LLMs) SASA tracks the margin of the current output to steer the generation away from the toxic subspace, by adjusting the autoregressive sampling strategy. evaluated on LLMs of different scale and nature, namely Llama-3.1-Instruct (8B), Llama-2 (7B), and GPT2-L models with the RealToxicityPrompts, BOLD, and AttaQ benchmarks.
arXiv Detail & Related papers (2024-10-04T17:45:15Z)
Towards Robust and Parameter-Efficient Knowledge Unlearning for LLMs [25.91643745340183]
Large Language Models (LLMs) have demonstrated strong reasoning and memorization capabilities via pretraining on massive textual corpora. This poses risk of privacy and copyright violations, highlighting the need for efficient machine unlearning methods. We propose Low-rank Knowledge Unlearning (LoKU), a novel framework that enables robust and efficient unlearning for LLMs.
arXiv Detail & Related papers (2024-08-13T04:18:32Z)
Soft Prompting for Unlearning in Large Language Models [11.504012974208466]
This work focuses on investigating machine unlearning for Large Language Models motivated by data protection regulations. We propose a framework textbfSoft textbfPrompting for textbfUntextbflearning (SPUL) We conduct a rigorous evaluation of the proposed method and our results indicate that SPUL can significantly improve the trade-off between utility and forgetting.
arXiv Detail & Related papers (2024-06-17T19:11:40Z)
One Token Can Help! Learning Scalable and Pluggable Virtual Tokens for Retrieval-Augmented Large Language Models [67.49462724595445]
Retrieval-augmented generation (RAG) is a promising way to improve large language models (LLMs) We propose a novel method that involves learning scalable and pluggable virtual tokens for RAG.
arXiv Detail & Related papers (2024-05-30T03:44:54Z)
Robustifying Safety-Aligned Large Language Models through Clean Data Curation [11.273749179260468]
Large language models (LLMs) are vulnerable when trained on datasets containing harmful content. In this paper, we propose a data curation framework designed to counter adversarial impacts in both scenarios.
arXiv Detail & Related papers (2024-05-24T04:50:38Z)
Machine Unlearning in Large Language Models [0.7864304771129751]
This paper introduces a methodology to align large language models (LLMs) with ethical, privacy, and safety standards. Our approach aims to selectively erase or modify learned information in LLMs, targeting harmful responses and copyrighted content.
arXiv Detail & Related papers (2024-05-24T02:12:51Z)
Offset Unlearning for Large Language Models [49.851093293780615]
delta-Unlearning is an offset unlearning framework for black-box LLMs.<n>We show that delta-Unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks.
arXiv Detail & Related papers (2024-04-17T03:39:51Z)
Developing Safe and Responsible Large Language Model : Can We Balance Bias Reduction and Language Understanding in Large Language Models? [2.089112028396727]
This study explores whether Large Language Models can produce safe, unbiased outputs without sacrificing knowledge or comprehension. We introduce the Safe and Responsible Large Language Model (textbfSR$_textLLM$), which has been instruction fine-tuned atop an inherently safe fine-tuned LLM. Experiments reveal that textbfSR$_textLLM$ effectively reduces biases while preserving knowledge integrity.
arXiv Detail & Related papers (2024-04-01T18:10:05Z)
TEaR: Improving LLM-based Machine Translation with Systematic Self-Refinement [26.26493253161022]
Large Language Models (LLMs) have achieved impressive results in Machine Translation (MT) We introduce a systematic LLM-based self-refinement translation framework, named textbfTEaR.
arXiv Detail & Related papers (2024-02-26T07:58:12Z)
Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
Reflection-Tuning: Data Recycling Improves LLM Instruction-Tuning [79.32236399694077]
Low-quality data in the training set are usually detrimental to instruction tuning. We propose a novel method, termed "reflection-tuning" This approach utilizes an oracle LLM to recycle the original training data by introspecting and enhancing the quality of instructions and responses in the data.
arXiv Detail & Related papers (2023-10-18T05:13:47Z)
TRACE: A Comprehensive Benchmark for Continual Learning in Large Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety. Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs. We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.