Differentially Private Steering for Large Language Model Alignment
- URL: http://arxiv.org/abs/2501.18532v1
- Date: Thu, 30 Jan 2025 17:58:36 GMT
- Title: Differentially Private Steering for Large Language Model Alignment
- Authors: Anmol Goel, Yaxi Hu, Iryna Gurevych, Amartya Sanyal,
- Abstract summary: We present the first study of aligning Large Language Models with private datasets.
Our work proposes the textitunderlinePrivate underlineSteering for LLM underlineAment (PSA) algorithm.
Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance.
- Score: 55.30573701583768
- License:
- Abstract: Aligning Large Language Models (LLMs) with human values and away from undesirable behaviors (such as hallucination) has become increasingly important. Recently, steering LLMs towards a desired behavior via activation editing has emerged as an effective method to mitigate harmful generations at inference-time. Activation editing modifies LLM representations by preserving information from positive demonstrations (e.g., truthful) and minimising information from negative demonstrations (e.g., hallucinations). When these demonstrations come from a private dataset, the aligned LLM may leak private information contained in those private samples. In this work, we present the first study of aligning LLM behavior with private datasets. Our work proposes the \textit{\underline{P}rivate \underline{S}teering for LLM \underline{A}lignment (PSA)} algorithm to edit LLM activations with differential privacy (DP) guarantees. We conduct extensive experiments on seven different benchmarks with open-source LLMs of different sizes (0.5B to 7B) and model families (LlaMa, Qwen, Mistral and Gemma). Our results show that PSA achieves DP guarantees for LLM alignment with minimal loss in performance, including alignment metrics, open-ended text generation quality, and general-purpose reasoning. We also develop the first Membership Inference Attack (MIA) for evaluating and auditing the empirical privacy for the problem of LLM steering via activation editing. Our attack is tailored for activation editing and relies solely on the generated texts without their associated probabilities. Our experiments support the theoretical guarantees by showing improved guarantees for our \textit{PSA} algorithm compared to several existing non-private techniques.
Related papers
- LLM-Lasso: A Robust Framework for Domain-Informed Feature Selection and Regularization [59.75242204923353]
We introduce LLM-Lasso, a framework that leverages large language models (LLMs) to guide feature selection in Lasso regression.
LLMs generate penalty factors for each feature, which are converted into weights for the Lasso penalty using a simple, tunable model.
Features identified as more relevant by the LLM receive lower penalties, increasing their likelihood of being retained in the final model.
arXiv Detail & Related papers (2025-02-15T02:55:22Z) - Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation [0.5070610131852027]
Large language models (LLMs) can be effectively misused for generating disinformation news articles.
This study fills this gap by evaluation of vulnerabilities of recent open and closed LLMs.
Our results demonstrate the need for stronger safety-filters and disclaimers.
arXiv Detail & Related papers (2024-12-18T09:48:53Z) - LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints [86.59857711385833]
We introduce RealInstruct, the first benchmark designed to evaluate LLMs' ability to follow real-world multi-constrained instructions.
To address the performance gap between open-source and proprietary models, we propose the Decompose, Critique and Refine (DeCRIM) self-correction pipeline.
Our results show that DeCRIM improves Mistral's performance by 7.3% on RealInstruct and 8.0% on IFEval even with weak feedback.
arXiv Detail & Related papers (2024-10-09T01:25:10Z) - zsLLMCode: An Effective Approach for Functional Code Embedding via LLM with Zero-Shot Learning [6.976968804436321]
Large language models (LLMs) have the capability of zero-shot learning, which does not require training or fine-tuning.
We propose zsLLMCode, a novel approach that generates functional code embeddings using LLMs.
arXiv Detail & Related papers (2024-09-23T01:03:15Z) - SELF-GUIDE: Better Task-Specific Instruction Following via Self-Synthetic Finetuning [70.21358720599821]
Large language models (LLMs) hold the promise of solving diverse tasks when provided with appropriate natural language prompts.
We propose SELF-GUIDE, a multi-stage mechanism in which we synthesize task-specific input-output pairs from the student LLM.
We report an absolute improvement of approximately 15% for classification tasks and 18% for generation tasks in the benchmark's metrics.
arXiv Detail & Related papers (2024-07-16T04:41:58Z) - Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models [52.98743860365194]
We propose a new fine-tuning method called Self-Play fIne-tuNing (SPIN)
At the heart of SPIN lies a self-play mechanism, where the LLM refines its capability by playing against instances of itself.
This sheds light on the promise of self-play, enabling the achievement of human-level performance in LLMs without the need for expert opponents.
arXiv Detail & Related papers (2024-01-02T18:53:13Z) - The Unlocking Spell on Base LLMs: Rethinking Alignment via In-Context
Learning [61.68787689234622]
A recent study, LIMA, shows that using merely 1K examples for alignment tuning can achieve significant alignment performance as well.
This raises questions about how exactly the alignment tuning transforms a base LLM.
We show that the gap between tuning-free and tuning-based alignment methods can be significantly reduced through strategic prompting.
arXiv Detail & Related papers (2023-12-04T00:46:11Z) - PrivLM-Bench: A Multi-level Privacy Evaluation Benchmark for Language Models [42.20437015301152]
We present PrivLM-Bench, a benchmark for evaluating the privacy leakage of language models (LMs)
Instead of only reporting DP parameters, PrivLM-Bench sheds light on the neglected inference data privacy during actual usage.
We conduct extensive experiments on three datasets of GLUE for mainstream LMs.
arXiv Detail & Related papers (2023-11-07T14:55:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.