Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
- URL: http://arxiv.org/abs/2405.13967v5
- Date: Sat, 01 Mar 2025 01:35:47 GMT
- Title: Model Editing as a Robust and Denoised variant of DPO: A Case Study on Toxicity
- Authors: Rheeya Uppaal, Apratim Dey, Yiting He, Yiqiao Zhong, Junjie Hu,
- Abstract summary: We introduce a tuning-free alignment alternative, ProFS, and demonstrate its effectiveness under the use case of toxicity reduction.<n>ProFS identifies a toxic subspace in the model parameter space and reduces model toxicity by projecting away the detected subspace.<n>We show that ProFS is more sample-efficient than DPO, further showcasing greater robustness to noisy data.
- Score: 6.786565820048478
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent alignment algorithms such as direct preference optimization (DPO) have been developed to improve the safety of large language models (LLMs) by training these models to match human behaviors exemplified by preference data. However, these methods are both computationally intensive and lacking in controllability and transparency, inhibiting their widespread use. Furthermore, these tuning-based methods require large-scale preference data for training and are susceptible to noisy preference data. In this paper, we introduce a tuning-free alignment alternative, ProFS (Projection Filter for Subspaces), and demonstrate its effectiveness under the use case of toxicity reduction. Grounded on theory from factor analysis, ProFS is a sample-efficient model editing approach that identifies a toxic subspace in the model parameter space and reduces model toxicity by projecting away the detected subspace. The toxic subspace is identified by extracting preference data embeddings from the language model, and removing non-toxic information from these embeddings. We show that ProFS is more sample-efficient than DPO, further showcasing greater robustness to noisy data. Finally, we attempt to connect tuning based alignment with editing, by establishing both theoretical and empirical connections between ProFS and DPO, showing that ProFS can be interpreted as a denoised version of a single DPO step.
Related papers
- DONOD: Robust and Generalizable Instruction Fine-Tuning for LLMs via Model-Intrinsic Dataset Pruning [22.704995231753397]
Ad-hoc instruction fine-tuning of large language models (LLMs) is widely adopted for domain-specific adaptation.
We propose DONOD, a lightweight model-intrinsic data pruning method.
By filtering out 70% of the full dataset, we improve target-domain accuracy by 14.90% and cross-domain accuracy by 5.67%.
arXiv Detail & Related papers (2025-04-21T02:25:03Z) - InPO: Inversion Preference Optimization with Reparametrized DDIM for Efficient Diffusion Model Alignment [12.823734370183482]
We introduce DDIM-InPO, an efficient method for direct preference alignment of diffusion models.
Our approach conceptualizes diffusion model as a single-step generative model, allowing us to fine-tune the outputs of specific latent variables selectively.
Experimental results indicate that our DDIM-InPO achieves state-of-the-art performance with just 400 steps of fine-tuning.
arXiv Detail & Related papers (2025-03-24T08:58:49Z) - DFF: Decision-Focused Fine-tuning for Smarter Predict-then-Optimize with Limited Data [7.70699448711673]
Decision-focused learning (DFL) offers an end-to-end approach to the predict-then-optimize (PO) framework by training predictive models directly on decision loss (DL)
Some predictive models are non-differentiable or black-box, which cannot be adjusted using gradient-based methods.
We propose a novel framework, Decision-Focused Fine-tuning (DFF), which embeds the DFL module into the PO pipeline via a novel bias correction module.
arXiv Detail & Related papers (2025-01-03T15:46:25Z) - Influence Functions for Scalable Data Attribution in Diffusion Models [52.92223039302037]
Diffusion models have led to significant advancements in generative modelling.
Yet their widespread adoption poses challenges regarding data attribution and interpretability.
In this paper, we aim to help address such challenges by developing an textitinfluence functions framework.
arXiv Detail & Related papers (2024-10-17T17:59:02Z) - ASFT: Aligned Supervised Fine-Tuning through Absolute Likelihood [14.512464277772194]
Aligned Supervised Fine-Tuning (ASFT) is an effective approach that better aligns Large Language Models with pair-wise datasets.
ASFT mitigates the issue where the DPO loss function decreases the probability of generating human-dispreferred data.
Extensive experiments demonstrate that ASFT is an effective alignment approach, consistently outperforming existing methods.
arXiv Detail & Related papers (2024-09-14T11:39:13Z) - Semi-supervised Regression Analysis with Model Misspecification and High-dimensional Data [8.619243141968886]
We present an inference framework for estimating regression coefficients in conditional mean models.
We develop an augmented inverse probability weighted (AIPW) method, employing regularized estimators for both propensity score (PS) and outcome regression (OR) models.
Our theoretical findings are verified through extensive simulation studies and a real-world data application.
arXiv Detail & Related papers (2024-06-20T00:34:54Z) - Everything to the Synthetic: Diffusion-driven Test-time Adaptation via Synthetic-Domain Alignment [76.44483062571611]
Test-time adaptation (TTA) aims to enhance the performance of source-domain pretrained models when tested on unknown shifted target domains.
Traditional TTA methods primarily adapt model weights based on target data streams, making model performance sensitive to the amount and order of target data.
Recent diffusion-driven TTA methods have demonstrated strong performance by using an unconditional diffusion model.
arXiv Detail & Related papers (2024-06-06T17:39:09Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - Diffusion Model Alignment Using Direct Preference Optimization [103.2238655827797]
Diffusion-DPO is a method to align diffusion models to human preferences by directly optimizing on human comparison data.
We fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO.
We also develop a variant that uses AI feedback and has comparable performance to training on human preferences.
arXiv Detail & Related papers (2023-11-21T15:24:05Z) - Balancing Act: Constraining Disparate Impact in Sparse Models [20.058720715290434]
We propose a constrained optimization approach that directly addresses the disparate impact of pruning.
Our formulation bounds the accuracy change between the dense and sparse models, for each sub-group.
Experimental results demonstrate that our technique scales reliably to problems involving large models and hundreds of protected sub-groups.
arXiv Detail & Related papers (2023-10-31T17:37:35Z) - Statistical Rejection Sampling Improves Preference Optimization [42.57245965632205]
We introduce a novel approach to source preference data from the target optimal policy using rejection sampling.
We also propose a unified framework that enhances the loss functions used in both Sequence Likelihood (SLiC) and Direct Preference Optimization (DPO) from a preference modeling standpoint.
arXiv Detail & Related papers (2023-09-13T01:07:25Z) - Fine-grained Retrieval Prompt Tuning [149.9071858259279]
Fine-grained Retrieval Prompt Tuning steers a frozen pre-trained model to perform the fine-grained retrieval task from the perspectives of sample prompt and feature adaptation.
Our FRPT with fewer learnable parameters achieves the state-of-the-art performance on three widely-used fine-grained datasets.
arXiv Detail & Related papers (2022-07-29T04:10:04Z) - LSDAT: Low-Rank and Sparse Decomposition for Decision-based Adversarial
Attack [74.5144793386864]
LSDAT crafts perturbations in the low-dimensional subspace formed by the sparse component of the input sample and that of an adversarial sample.
LSD works directly in the image pixel domain to guarantee that non-$ell$ constraints, such as sparsity, are satisfied.
arXiv Detail & Related papers (2021-03-19T13:10:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.