Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models
- URL: http://arxiv.org/abs/2406.10288v2
- Date: Mon, 1 Jul 2024 10:17:58 GMT
- Title: Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models
- Authors: Francisco Eiras, Aleksandar Petrov, Phillip H. S. Torr, M. Pawan Kumar, Adel Bibi,
- Abstract summary: Fine-tuning large language models on small datasets can enhance their performance on specific downstream tasks.
Malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors.
We propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data.
- Score: 53.50543146583101
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Fine-tuning large language models on small, high-quality datasets can enhance their performance on specific downstream tasks. Recent research shows that fine-tuning on benign, instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. Although critical, understanding and mitigating safety risks in well-defined tasks remains distinct from the instruction-following context due to structural differences in the data. Our work addresses the gap in our understanding of these risks across diverse types of data in closed models - where providers control how user data is utilized in the fine-tuning process. We demonstrate how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is more effective than existing baselines at re-establishing safety alignment while maintaining similar task performance.
Related papers
- Safety-Aware Fine-Tuning of Large Language Models [29.5636201427693]
Fine-tuning Large Language Models (LLMs) has emerged as a common practice for tailoring models to individual needs and preferences.
We propose a novel Safety-Aware Fine-Tuning (SAFT) framework designed to automatically detect and remove potentially harmful data.
arXiv Detail & Related papers (2024-10-13T21:24:25Z) - Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models [79.65071553905021]
We propose Data Advisor, a method for generating data that takes into account the characteristics of the desired dataset.
Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation.
arXiv Detail & Related papers (2024-10-07T17:59:58Z) - Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy.
Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models.
This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z) - Data-CUBE: Data Curriculum for Instruction-based Sentence Representation
Learning [85.66907881270785]
We propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training.
In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk.
In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training.
arXiv Detail & Related papers (2024-01-07T18:12:20Z) - Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations [63.73044203154743]
Self-supervised representation learning often uses data augmentations to induce "style" attributes of the data.
It is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded.
We introduce a more principled approach that seeks to disentangle style features rather than discard them.
arXiv Detail & Related papers (2023-11-15T09:34:08Z) - Assessing Privacy Risks in Language Models: A Case Study on
Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack.
We exploit text similarity and the model's resistance to document modifications as potential MI signals.
We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z) - The Poison of Alignment [0.0]
We introduce a novel insight to an instruction-tuned model's performance affected by the presence of alignment.
We demonstrate that aligned answers significantly worsen the performance of the resulting fine-tuned model's on various reasoning benchmarks.
arXiv Detail & Related papers (2023-08-25T15:51:15Z) - Leveraging Large-scale Multimedia Datasets to Refine Content Moderation
Models [8.147198294451151]
We propose a framework that leverages large-scale multimedia datasets to refine content moderation models.
The proposed method is evaluated on the Not Safe for Work (NSFW) and disturbing content detection tasks.
It significantly reduces human involvement, as 92.54% of data are automatically annotated in case of disturbing content.
arXiv Detail & Related papers (2022-12-01T17:19:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.