Related papers: IF-GUIDE: Influence Function-Guided Detoxification of LLMs

IF-GUIDE: Influence Function-Guided Detoxification of LLMs

URL: http://arxiv.org/abs/2506.01790v2
Date: Mon, 09 Jun 2025 04:36:12 GMT
Title: IF-GUIDE: Influence Function-Guided Detoxification of LLMs
Authors: Zachary Coalson, Juhan Bae, Nicholas Carlini, Sanghyun Hong,
Abstract summary: We study how training data contributes to the emergence of toxic behaviors in large-language models.<n>We propose a $proactive approach that leverages influence functions to identify harmful tokens within any training data and suppress their impact during training.<n>We present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective.
Score: 53.051109450536885
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study how training data contributes to the emergence of toxic behaviors in large-language models. Most prior work on reducing model toxicity adopts $reactive$ approaches, such as fine-tuning pre-trained (and potentially toxic) models to align them with human values. In contrast, we propose a $proactive$ approach$-$IF-Guide$-$which leverages influence functions to identify harmful tokens within any training data and suppress their impact during training. To this end, we first show that standard influence functions are ineffective at discovering harmful training records. We then present a novel adaptation that measures token-level attributions from training data to model toxicity, along with techniques for selecting toxic training documents and a learning objective that can be integrated into both pre-training and fine-tuning. Moreover, IF-Guide does not rely on human-preference data, which is typically required by existing alignment methods. In evaluation, we demonstrate that IF-Guide substantially reduces both explicit and implicit toxicity$-$by up to 10$\times$ compared to uncensored models, and up to 3$\times$ compared to baseline alignment methods, e.g., DPO and RAD$-$across both pre-training and fine-tuning scenarios. IF-Guide is computationally efficient: a billion-parameter model is $not$ $necessary$ for computing influence scores; a million-parameter model$-$with 7.5$\times$ fewer parameters$-$can effectively serve as a proxy for identifying harmful data. Our code is publicly available at: https://github.com/ztcoalson/IF-Guide

Related papers

Rescaled Influence Functions: Accurate Data Attribution in High Dimension [6.812390750464419]
We present rescaled influence functions (RIF), a new tool for data attribution which can be used as a drop-in replacement for influence functions.<n>We compare IF and RIF on a range of real-world datasets, showing that RIFs offer significantly better predictions in practice.
arXiv Detail & Related papers (2025-06-07T04:19:21Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.<n>Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Delta-Influence: Unlearning Poisons via Influence Functions [18.97730860349776]
We introduce $Delta$-Influence, a novel approach to trace abnormal model behavior back to poisoned training data. $Delta$-Influence applies data transformations that sever the link between poisoned training data and compromised test points. We show that $Delta$-Influence consistently achieves the best unlearning across all settings.
arXiv Detail & Related papers (2024-11-20T22:15:10Z)
Exploiting Pre-trained Models for Drug Target Affinity Prediction with Nearest Neighbors [58.661454334877256]
Drug-Target binding Affinity (DTA) prediction is essential for drug discovery. Despite the application of deep learning methods to DTA prediction, the achieved accuracy remain suboptimal. We propose $k$NN-DTA, a non-representation embedding-based retrieval method adopted on a pre-trained DTA prediction model.
arXiv Detail & Related papers (2024-07-21T15:49:05Z)
$\nabla τ$: Gradient-based and Task-Agnostic machine Unlearning [7.04736023670375]
We introduce Gradient-based and Task-Agnostic machine Unlearning ($nabla tau$) $nabla tau$ applies adaptive gradient ascent to the data to be forgotten while using standard gradient descent for the remaining data. We evaluate our framework's effectiveness using a set of well-established Membership Inference Attack metrics.
arXiv Detail & Related papers (2024-03-21T12:11:26Z)
Unlearning Traces the Influential Training Data of Language Models [31.33791825286853]
This paper presents UnTrac: unlearning traces the influence of a training dataset on the model's performance. We propose a more scalable approach, UnTrac-Inv, which unlearns a test dataset and evaluates the unlearned model on training datasets.
arXiv Detail & Related papers (2024-01-26T23:17:31Z)
Recommendation Unlearning via Influence Function [42.4931807753579]
We propose a new Influence Function-based Recommendation Unlearning (IFRU) framework, which efficiently updates the model without retraining. IFRU achieves more than 250 times acceleration compared to retraining-based methods with recommendation performance comparable to full retraining.
arXiv Detail & Related papers (2023-07-05T09:42:51Z)
Robust Learning with Progressive Data Expansion Against Spurious Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features. Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process. We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z)
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity [84.6421260559093]
This study is the largest set of experiments to validate, quantify, and expose undocumented intuitions about text pretraining. Our findings indicate there does not exist a one-size-fits-all solution to filtering training data.
arXiv Detail & Related papers (2023-05-22T15:57:53Z)
Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs [27.314239745883967]
Training data attribution (TDA) methods trace a model's prediction on any given example back to specific influential training examples. We propose Simfluence, a new paradigm for TDA where the goal is not to produce a single influence score per example, but instead a training run simulator. Simfluence captures non-additive interactions and is often able to predict the spiky trajectory of individual example losses with surprising fidelity.
arXiv Detail & Related papers (2023-03-14T17:47:25Z)
One-Pixel Shortcut: on the Learning Preference of Deep Neural Networks [28.502489028888608]
Unlearnable examples (ULEs) aim to protect data from unauthorized usage for training DNNs. In adversarial training, the unlearnability of error-minimizing noise will severely degrade. We propose a novel model-free method, named emphOne-Pixel Shortcut, which only perturbs a single pixel of each image and makes the dataset unlearnable.
arXiv Detail & Related papers (2022-05-24T15:17:52Z)
Datamodels: Predicting Predictions from Training Data [86.66720175866415]
We present a conceptual framework, datamodeling, for analyzing the behavior of a model class in terms of the training data. We show that even simple linear datamodels can successfully predict model outputs.
arXiv Detail & Related papers (2022-02-01T18:15:24Z)
Guided Interpolation for Adversarial Training [73.91493448651306]
As training progresses, the training data becomes less and less attackable, undermining the robustness enhancement. We propose the guided framework (GIF), which employs the previous epoch's meta information to guide the data's adversarial variants. Compared with the vanilla mixup, the GIF can provide a higher ratio of attackable data, which is beneficial to the robustness enhancement.
arXiv Detail & Related papers (2021-02-15T03:55:08Z)
Self-Adaptive Training: beyond Empirical Risk Minimization [15.59721834388181]
We propose a new training algorithm that dynamically corrects problematic labels by model predictions without incurring extra computational cost. Self-adaptive training significantly improves generalization over various levels of noises, and mitigates the overfitting issue in both natural and adversarial training. Experiments on CIFAR and ImageNet datasets verify the effectiveness of our approach in two applications.
arXiv Detail & Related papers (2020-02-24T15:47:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.