The effect of fine-tuning on language model toxicity
- URL: http://arxiv.org/abs/2410.15821v1
- Date: Mon, 21 Oct 2024 09:39:09 GMT
- Title: The effect of fine-tuning on language model toxicity
- Authors: Will Hawkins, Brent Mittelstadt, Chris Russell,
- Abstract summary: Fine-tuning language models has become increasingly popular following the proliferation of open models.
We assess how fine-tuning can impact different open models' propensity to output toxic content.
We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation can significantly alter these results.
- Score: 7.539523407936451
- License:
- Abstract: Fine-tuning language models has become increasingly popular following the proliferation of open models and improvements in cost-effective parameter efficient fine-tuning. However, fine-tuning can influence model properties such as safety. We assess how fine-tuning can impact different open models' propensity to output toxic content. We assess the impacts of fine-tuning Gemma, Llama, and Phi models on toxicity through three experiments. We compare how toxicity is reduced by model developers during instruction-tuning. We show that small amounts of parameter-efficient fine-tuning on developer-tuned models via low-rank adaptation on a non-adversarial dataset can significantly alter these results across models. Finally, we highlight the impact of this in the wild, demonstrating how toxicity rates of models fine-tuned by community contributors can deviate in hard-to-predict ways.
Related papers
- Causal Fine-Tuning and Effect Calibration of Non-Causal Predictive Models [1.3124513975412255]
This paper proposes techniques to enhance the performance of non-causal models for causal inference using data from randomized experiments.
In domains like advertising, customer retention, and precision medicine, non-causal models that predict outcomes under no intervention are often used to score individuals and rank them according to the expected effectiveness of an intervention.
arXiv Detail & Related papers (2024-06-13T20:18:16Z) - Low-rank finetuning for LLMs: A fairness perspective [54.13240282850982]
Low-rank approximation techniques have become the de facto standard for fine-tuning Large Language Models.
This paper investigates the effectiveness of these methods in capturing the shift of fine-tuning datasets from the initial pre-trained data distribution.
We show that low-rank fine-tuning inadvertently preserves undesirable biases and toxic behaviors.
arXiv Detail & Related papers (2024-05-28T20:43:53Z) - Unfamiliar Finetuning Examples Control How Language Models Hallucinate [75.03210107477157]
Large language models are known to hallucinate when faced with unfamiliar queries.
We find that unfamiliar examples in the models' finetuning data are crucial in shaping these errors.
Our work further investigates RL finetuning strategies for improving the factuality of long-form model generations.
arXiv Detail & Related papers (2024-03-08T18:28:13Z) - Let the Models Respond: Interpreting Language Model Detoxification
Through the Lens of Prompt Dependence [15.084940396969]
We apply popular detoxification approaches to several language models and quantify their impact on the resulting models' prompt dependence.
We evaluate the effectiveness of counter-narrative fine-tuning and compare it with reinforcement learning-driven detoxification.
arXiv Detail & Related papers (2023-09-01T22:26:06Z) - E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning [55.50908600818483]
Fine-tuning large-scale pretrained vision models for new tasks has become increasingly parameter-intensive.
We propose an Effective and Efficient Visual Prompt Tuning (E2VPT) approach for large-scale transformer-based model adaptation.
Our approach outperforms several state-of-the-art baselines on two benchmarks.
arXiv Detail & Related papers (2023-07-25T19:03:21Z) - A Three-regime Model of Network Pruning [47.92525418773768]
We use temperature-like and load-like parameters to model the impact of neural network (NN) training hyper parameters on pruning performance.
A key empirical result we identify is a sharp transition phenomenon: depending on the value of a load-like parameter in the pruned model, increasing the value of a temperature-like parameter in the pre-pruned model may either enhance or impair subsequent pruning performance.
Our model reveals that the dichotomous effect of high temperature is associated with transitions between distinct types of global structures in the post-pruned model.
arXiv Detail & Related papers (2023-05-28T08:09:25Z) - Your Autoregressive Generative Model Can be Better If You Treat It as an
Energy-Based One [83.5162421521224]
We propose a unique method termed E-ARM for training autoregressive generative models.
E-ARM takes advantage of a well-designed energy-based learning objective.
We show that E-ARM can be trained efficiently and is capable of alleviating the exposure bias problem.
arXiv Detail & Related papers (2022-06-26T10:58:41Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - Model Compression for Domain Adaptation through Causal Effect Estimation [20.842938440720303]
ATE-guided Model Compression scheme (AMoC) generates many model candidates, differing by the model components that were removed.
Then, we select the best candidate through a stepwise regression model that utilizes the ATE to predict the expected performance on the target domain.
AMoC outperforms strong baselines on 46 of 60 domain pairs across two text classification tasks, with an average improvement of more than 3% in F1 above the strongest baseline.
arXiv Detail & Related papers (2021-01-18T14:18:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.