The Poison of Alignment
- URL: http://arxiv.org/abs/2308.13449v1
- Date: Fri, 25 Aug 2023 15:51:15 GMT
- Title: The Poison of Alignment
- Authors: Aibek Bekbayev, Sungbae Chun, Yerzat Dulat, James Yamazaki
- Abstract summary: We introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment.
We demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: From the perspective of content safety issues, alignment has shown to limit
large language models' (LLMs) harmful content generation. This intentional
method of reinforcing models to not respond to certain user inputs seem to be
present in many modern open-source instruction tuning datasets such as
OpenAssistant or Guanaco. We introduce a novel insight to an instruction-tuned
model's performance affected by the presence of alignment in supervised
fine-tuning dataset. To be specific, we noticed that alignment acts as if it is
poisoning the instruction dataset. Experimentally, we demonstrate that aligned
answers significantly worsen the performance of the resulting fine-tuned
model's on various reasoning benchmarks such as Big Bench (BBH), Massive
Multitask Language Understanding (MMLU), Human Eval, and Discrete Reasoning
Over Paragraphs (DROP), performing worse than the counterpart tuned without
alignment by 4-33%.
Related papers
- Context-Parametric Inversion: Why Instruction Finetuning May Not Actually Improve Context Reliance [68.56701216210617]
In-principle, one would expect models to adapt to the user context better after instruction finetuning.
We observe a surprising failure mode: during instruction tuning, the context reliance under knowledge conflicts initially increases as expected, but then gradually decreases.
arXiv Detail & Related papers (2024-10-14T17:57:09Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models [53.50543146583101]
Fine-tuning large language models on small datasets can enhance their performance on specific downstream tasks.
Malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors.
We propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data.
arXiv Detail & Related papers (2024-06-12T18:33:11Z) - Disperse-Then-Merge: Pushing the Limits of Instruction Tuning via Alignment Tax Reduction [75.25114727856861]
Large language models (LLMs) tend to suffer from deterioration at the latter stage ofSupervised fine-tuning process.
We introduce a simple disperse-then-merge framework to address the issue.
Our framework outperforms various sophisticated methods such as data curation and training regularization on a series of standard knowledge and reasoning benchmarks.
arXiv Detail & Related papers (2024-05-22T08:18:19Z) - Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging [25.078498180620425]
We present a step-by-step evaluation framework, textbfFennec, capable of textbfFine-grained textbfEvaluatiotextbfN textbfExtended through brantextbfChing and bridging.
We employ the fine-grained correction capabilities induced by the evaluation model to refine multiple model responses, leading to an improvement of 1-2 points on the MT-Bench.
arXiv Detail & Related papers (2024-05-20T16:47:22Z) - Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! [65.06450319194454]
Large language models (LLMs) undergo safety alignment to ensure safe conversations with humans.
This paper introduces a training-free attack method capable of reversing safety alignment.
We name this method emulated disalignment (ED) because sampling from this contrastive distribution provably emulates the result of fine-tuning to minimize a safety reward.
arXiv Detail & Related papers (2024-02-19T18:16:51Z) - Adversarial Fine-Tuning of Language Models: An Iterative Optimisation
Approach for the Generation and Detection of Problematic Content [0.0]
We tackle the emerging challenge of unintended harmful content generation in Large Language Models (LLMs)
Our two-pronged approach employs an adversarial model, fine-tuned to generate potentially harmful prompts, and a judge model, iteratively optimised to discern these prompts.
We show that a rudimentary model textttada can achieve 13% higher accuracy on the hold-out test set than GPT-4 after only a few rounds of this process.
arXiv Detail & Related papers (2023-08-26T05:20:58Z) - Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning [92.85265959892115]
This paper introduces the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction.
Our dataset comprises 400k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers.
To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a stable approach to evaluate visual instruction tuning like human experts.
arXiv Detail & Related papers (2023-06-26T10:26:33Z) - Improving the Faithfulness of Abstractive Summarization via Entity
Coverage Control [27.214742188672464]
We propose a method to remedy entity-level hallucinations with Entity Coverage Control (ECC)
ECC computes entity coverage precision and prepend the corresponding control code for each training example.
We show that the proposed method leads to more faithful and salient abstractive summarization in supervised fine-tuning and zero-shot settings.
arXiv Detail & Related papers (2022-07-05T18:52:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.