Related papers: Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models

URL: http://arxiv.org/abs/2406.10288v2
Date: Mon, 1 Jul 2024 10:17:58 GMT
Title: Mimicking User Data: On Mitigating Fine-Tuning Risks in Closed Large Language Models
Authors: Francisco Eiras, Aleksandar Petrov, Phillip H. S. Torr, M. Pawan Kumar, Adel Bibi,
Abstract summary: Fine-tuning large language models on small datasets can enhance their performance on specific downstream tasks. Malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors. We propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data.
Score: 53.50543146583101
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Fine-tuning large language models on small, high-quality datasets can enhance their performance on specific downstream tasks. Recent research shows that fine-tuning on benign, instruction-following data can inadvertently undo the safety alignment process and increase a model's propensity to comply with harmful queries. Although critical, understanding and mitigating safety risks in well-defined tasks remains distinct from the instruction-following context due to structural differences in the data. Our work addresses the gap in our understanding of these risks across diverse types of data in closed models - where providers control how user data is utilized in the fine-tuning process. We demonstrate how malicious actors can subtly manipulate the structure of almost any task-specific dataset to foster significantly more dangerous model behaviors, while maintaining an appearance of innocuity and reasonable downstream task performance. To address this issue, we propose a novel mitigation strategy that mixes in safety data which mimics the task format and prompting style of the user data, showing this is more effective than existing baselines at re-establishing safety alignment while maintaining similar task performance.

Related papers

LLM Performance for Code Generation on Noisy Tasks [0.41942958779358674]
We show that large language models (LLMs) can solve tasks obfuscated to a level where the text would be unintelligible to human readers.<n>We report empirical evidence of distinct performance decay patterns between contaminated and unseen datasets.<n>We propose measuring the decay of performance under obfuscation as a possible strategy for detecting dataset contamination.
arXiv Detail & Related papers (2025-05-29T16:11:18Z)
SEFE: Superficial and Essential Forgetting Eliminator for Multimodal Continual Instruction Tuning [62.18315467642528]
Multimodal Continual Instruction Tuning (MCIT) aims to enable Multimodal Large Language Models (MLLMs) to incrementally learn new tasks without catastrophic forgetting.<n>Superficial forgetting refers to cases where the model's knowledge may not be genuinely lost, but its responses to previous tasks deviate from expected formats.<n>By contrast, essential forgetting refers to situations where the model provides correctly formatted but factually inaccurate answers, indicating a true loss of knowledge.
arXiv Detail & Related papers (2025-05-05T09:09:41Z)
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? [83.53005932513155]
Multi-modal large language models (MLLMs) have made significant progress, yet their safety alignment remains limited. We propose finetuning MLLMs on a small set of benign instruct-following data with responses replaced by simple, clear rejection sentences.
arXiv Detail & Related papers (2025-04-14T09:03:51Z)
ATEB: Evaluating and Improving Advanced NLP Tasks for Text Embedding Models [27.18321648849259]
More advanced NLP tasks require a deeper understanding of text, such as safety and factuality. We introduce a new benchmark designed to assess and highlight the limitations of embedding models trained on existing information retrieval data mixtures. We propose a novel method that reformulates these various tasks as retrieval tasks.
arXiv Detail & Related papers (2025-02-24T01:08:15Z)
Learning Task Representations from In-Context Learning [73.72066284711462]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning. We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads. We show that our method's effectiveness stems from aligning the distribution of the last hidden state with that of an optimally performing in-context-learned model.
arXiv Detail & Related papers (2025-02-08T00:16:44Z)
Enhancing Pre-Trained Decision Transformers with Prompt-Tuning Bandits [2.6731152954002924]
We introduce a scalable bandit-based prompt-tuning method that learns to construct high-performance trajectory prompts. Our approach significantly enhances downstream task performance without modifying the pre-trained Transformer backbone.
arXiv Detail & Related papers (2025-02-07T14:57:17Z)
Enhancing AI Safety Through the Fusion of Low Rank Adapters [7.384556630042846]
Low-Rank Adapter Fusion mitigates harmful responses when faced with malicious prompts. We show a 42% reduction in the harmfulness rate by leveraging LoRA fusion between a task adapter and a safety adapter. We also observe exaggerated safety behaviour, where the model rejects safe prompts that closely resemble unsafe ones.
arXiv Detail & Related papers (2024-12-30T13:12:27Z)
On the Privacy Risk of In-context Learning [36.633860818454984]
We show that deploying prompted models presents a significant privacy risk for the data used within the prompt. We also observe that the privacy risk of prompted models exceeds fine-tuned models at the same utility levels.
arXiv Detail & Related papers (2024-11-15T17:11:42Z)
Safety-Aware Fine-Tuning of Large Language Models [29.5636201427693]
Fine-tuning Large Language Models (LLMs) has emerged as a common practice for tailoring models to individual needs and preferences. We propose a novel Safety-Aware Fine-Tuning (SAFT) framework designed to automatically detect and remove potentially harmful data.
arXiv Detail & Related papers (2024-10-13T21:24:25Z)
Data Advisor: Dynamic Data Curation for Safety Alignment of Large Language Models [79.65071553905021]
We propose Data Advisor, a method for generating data that takes into account the characteristics of the desired dataset. Data Advisor monitors the status of the generated data, identifies weaknesses in the current dataset, and advises the next iteration of data generation.
arXiv Detail & Related papers (2024-10-07T17:59:58Z)
Robust Utility-Preserving Text Anonymization Based on Large Language Models [80.5266278002083]
Text anonymization is crucial for sharing sensitive data while maintaining privacy. Existing techniques face the emerging challenges of re-identification attack ability of Large Language Models. This paper proposes a framework composed of three LLM-based components -- a privacy evaluator, a utility evaluator, and an optimization component.
arXiv Detail & Related papers (2024-07-16T14:28:56Z)
Data-CUBE: Data Curriculum for Instruction-based Sentence Representation Learning [85.66907881270785]
We propose a data curriculum method, namely Data-CUBE, that arranges the orders of all the multi-task data for training. In the task level, we aim to find the optimal task order to minimize the total cross-task interference risk. In the instance level, we measure the difficulty of all instances per task, then divide them into the easy-to-difficult mini-batches for training.
arXiv Detail & Related papers (2024-01-07T18:12:20Z)
Task-Distributionally Robust Data-Free Meta-Learning [99.56612787882334]
Data-Free Meta-Learning (DFML) aims to efficiently learn new tasks by leveraging multiple pre-trained models without requiring their original training data. For the first time, we reveal two major challenges hindering their practical deployments: Task-Distribution Shift ( TDS) and Task-Distribution Corruption (TDC)
arXiv Detail & Related papers (2023-11-23T15:46:54Z)
Self-Supervised Disentanglement by Leveraging Structure in Data Augmentations [63.73044203154743]
Self-supervised representation learning often uses data augmentations to induce "style" attributes of the data. It is difficult to deduce a priori which attributes of the data are indeed "style" and can be safely discarded. We introduce a more principled approach that seeks to disentangle style features rather than discard them.
arXiv Detail & Related papers (2023-11-15T09:34:08Z)
Assessing Privacy Risks in Language Models: A Case Study on Summarization Tasks [65.21536453075275]
We focus on the summarization task and investigate the membership inference (MI) attack. We exploit text similarity and the model's resistance to document modifications as potential MI signals. We discuss several safeguards for training summarization models to protect against MI attacks and discuss the inherent trade-off between privacy and utility.
arXiv Detail & Related papers (2023-10-20T05:44:39Z)
Leveraging sparse and shared feature activations for disentangled representation learning [112.22699167017471]
We propose to leverage knowledge extracted from a diversified set of supervised tasks to learn a common disentangled representation. We validate our approach on six real world distribution shift benchmarks, and different data modalities.
arXiv Detail & Related papers (2023-04-17T01:33:24Z)
Leveraging Large-scale Multimedia Datasets to Refine Content Moderation Models [8.147198294451151]
We propose a framework that leverages large-scale multimedia datasets to refine content moderation models. The proposed method is evaluated on the Not Safe for Work (NSFW) and disturbing content detection tasks. It significantly reduces human involvement, as 92.54% of data are automatically annotated in case of disturbing content.
arXiv Detail & Related papers (2022-12-01T17:19:13Z)
Auxiliary Task Update Decomposition: The Good, The Bad and The Neutral [18.387162887917164]
We formulate a model-agnostic framework that performs fine-grained manipulation of the auxiliary task gradients. We propose to decompose auxiliary updates into directions which help, damage or leave the primary task loss unchanged. Our approach consistently outperforms strong and widely used baselines when leveraging out-of-distribution data for Text and Image classification tasks.
arXiv Detail & Related papers (2021-08-25T17:09:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.