ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data
- URL: http://arxiv.org/abs/2504.14452v1
- Date: Sun, 20 Apr 2025 01:59:46 GMT
- Title: ParaPO: Aligning Language Models to Reduce Verbatim Reproduction of Pre-training Data
- Authors: Tong Chen, Faeze Brahman, Jiacheng Liu, Niloofar Mireshghallah, Weijia Shi, Pang Wei Koh, Luke Zettlemoyer, Hannaneh Hajishirzi,
- Abstract summary: Language models (LMs) can memorize and reproduce segments verbatim even in non-adversarial settings.<n>We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation.<n>We develop a variant of ParaPO that uses system prompts to control regurgitation behavior.
- Score: 95.69966871257381
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language models (LMs) can memorize and reproduce segments from their pretraining data verbatim even in non-adversarial settings, raising concerns about copyright, plagiarism, privacy, and creativity. We introduce Paraphrase Preference Optimization (ParaPO), a post-training method that fine-tunes LMs to reduce unintentional regurgitation while preserving their overall utility. ParaPO trains LMs to prefer paraphrased versions of memorized segments over the original verbatim content from the pretraining data. To maintain the ability to recall famous quotations when appropriate, we develop a variant of ParaPO that uses system prompts to control regurgitation behavior. In our evaluation on Llama3.1-8B, ParaPO consistently reduces regurgitation across all tested datasets (e.g., reducing the regurgitation metric from 17.3 to 12.9 in creative writing), whereas unlearning methods used in prior work to mitigate regurgitation are less effective outside their targeted unlearned domain (from 17.3 to 16.9). When applied to the instruction-tuned Tulu3-8B model, ParaPO with system prompting successfully preserves famous quotation recall while reducing unintentional regurgitation (from 8.7 to 6.3 in creative writing) when prompted not to regurgitate. In contrast, without ParaPO tuning, prompting the model not to regurgitate produces only a marginal reduction (8.7 to 8.4).
Related papers
- Iterative Length-Regularized Direct Preference Optimization: A Case Study on Improving 7B Language Models to GPT-4 Level [50.897438358317686]
We show that iLR-DPO can enhance a 7B model to perform on par with GPT-4 without increasing verbosity.
Specifically, our 7B model achieves a $50.5%$ length-controlled win rate against $texttGPT-4 Preview$ on AlpacaEval 2.0.
arXiv Detail & Related papers (2024-06-17T17:55:38Z) - Self-Play Preference Optimization for Language Model Alignment [75.83359213697854]
Recent advancements suggest that directly working with preference probabilities can yield a more accurate reflection of human preferences.
We propose a self-play-based method for language model alignment, which treats the problem as a constant-sum two-player game.
Our approach, dubbed Self-Play Preference Optimization (SPPO), utilizes iterative policy updates to provably approximate the Nash equilibrium.
arXiv Detail & Related papers (2024-05-01T17:59:20Z) - Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning [28.059563581973432]
Large Language Models (LLMs) often have sensitive, private, or copyrighted data during pre-training.
LLMs unlearning aims to eliminate the influence of undesirable data from the pre-trained model.
We propose Negative Preference Optimization (NPO) as a simple alignment-inspired method that could efficiently unlearn a target dataset.
arXiv Detail & Related papers (2024-04-08T21:05:42Z) - Disentangling Length from Quality in Direct Preference Optimization [93.74831404396174]
Reinforcement Learning from Human Feedback (RLHF) has been a crucial component in the recent success of Large Language Models.
RLHF is know to exploit biases in human preferences, such as verbosity.
We develop a principled but simple regularization strategy that prevents length exploitation, while still maintaining improvements in model quality.
arXiv Detail & Related papers (2024-03-28T06:03:47Z) - RLVF: Learning from Verbal Feedback without Overgeneralization [94.19501420241188]
We study the problem of incorporating verbal feedback without such overgeneralization.
We develop a new method Contextualized Critiques with Constrained Preference Optimization (C3PO)
Our approach effectively applies verbal feedback to relevant scenarios while preserving existing behaviors for other contexts.
arXiv Detail & Related papers (2024-02-16T18:50:24Z) - Setting the Trap: Capturing and Defeating Backdoors in Pretrained
Language Models through Honeypots [68.84056762301329]
Recent research has exposed the susceptibility of pretrained language models (PLMs) to backdoor attacks.
We propose and integrate a honeypot module into the original PLM to absorb backdoor information exclusively.
Our design is motivated by the observation that lower-layer representations in PLMs carry sufficient backdoor features.
arXiv Detail & Related papers (2023-10-28T08:21:16Z) - Low-rank Adaptation of Large Language Model Rescoring for
Parameter-Efficient Speech Recognition [32.24656612803592]
We propose a neural language modeling system based on low-rank adaptation (LoRA) for speech recognition output rescoring.
We present a method based on low-rank decomposition to train a rescoring BERT model and adapt it to new domains using only a fraction of the pretrained parameters.
The proposed low-rank adaptation Rescore-BERT (LoRB) architecture is evaluated on LibriSpeech and internal datasets with decreased training times by factors between 5.4 and 3.6.
arXiv Detail & Related papers (2023-09-26T19:41:34Z) - RAIN: Your Language Models Can Align Themselves without Finetuning [25.703729145091483]
Large language models (LLMs) often demonstrate inconsistencies with human preferences.
We show that unaligned LLMs can directly produce responses consistent with human preferences via self-boosting.
We introduce a novel inference method, Rewindable Auto-regressive INference (RAIN), that allows pre-trained LLMs to evaluate their own generation.
arXiv Detail & Related papers (2023-09-13T17:59:09Z) - Controlling the Extraction of Memorized Data from Large Language Models
via Prompt-Tuning [14.228909822681373]
Large Language Models (LLMs) are known to memorize significant portions of their training data.
We present a novel approach which uses prompt-tuning to control the extraction rates of memorized content in LLMs.
arXiv Detail & Related papers (2023-05-19T15:45:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.