Related papers: Does Unlearning Truly Unlearn? A Black Box Evaluation of LLM Unlearning Methods

Related papers

PULSE: Practical Evaluation Scenarios for Large Multimodal Model Unlearning [27.16106173526184]
We introduce PULSE protocol for realistic unlearning scenarios for LMMs.<n>We then evaluate existing unlearning methods along these dimensions.<n>Our results reveal that, although some techniques can successfully unlearn knowledge acquired through fine-tuning, they struggle to eliminate information learned during pre-training.
arXiv Detail & Related papers (2025-07-02T01:13:08Z)
Large Language Model Unlearning for Source Code [65.42425213605114]
PROD is a novel unlearning approach that enables LLMs to forget undesired code content while preserving their code generation capabilities.<n>Our evaluation demonstrates that PROD achieves superior balance between forget quality and model utility compared to existing unlearning approaches.
arXiv Detail & Related papers (2025-06-20T16:27:59Z)
BLUR: A Benchmark for LLM Unlearning Robust to Forget-Retain Overlap [18.68387394444096]
Machine unlearning has the potential to improve the safety of large language models (LLMs) by removing sensitive or harmful information post hoc.<n>A key challenge in unlearning involves balancing between forget quality (effectively unlearning undesirable information) and retain quality (maintaining good performance on other, general tasks)<n>We present $textttBLUR$: a benchmark for LLM unlearning that provides more realistic scenarios of forget-retain overlap.
arXiv Detail & Related papers (2025-05-28T22:09:04Z)
Effective LLM Knowledge Learning via Model Generalization [73.16975077770765]
Large language models (LLMs) are trained on enormous documents that contain extensive world knowledge. It is still not well-understood how knowledge is acquired via autoregressive pre-training. In this paper, we focus on understanding and improving LLM knowledge learning.
arXiv Detail & Related papers (2025-03-05T17:56:20Z)
S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference. Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z)
Agents Are All You Need for LLM Unlearning [9.934258340998047]
textttALU is a multi-agent, retrain-free, model-agnostic approach to LLM unlearning.<n>textttALU consistently stands out as the most robust inference-time LLM unlearning framework.<n>textttALU is assessed on up to 1000 unlearning targets, exceeding the evaluation scope of all previously proposed LLM unlearning methods.
arXiv Detail & Related papers (2025-02-01T11:45:44Z)
Multi-Objective Large Language Model Unlearning [3.372396620898397]
Gradient Ascent (GA) is a proactive way to decrease the prediction probability of the model on the target data. We propose Multi-Objective Large Language Model Unlearning (MOLLM) algorithm to overcome gradient explosion and catastrophic forgetting. Our empirical results verify that MoLLM outperforms the SOTA GA-based LLM unlearning methods in terms of unlearning effect and model utility preservation.
arXiv Detail & Related papers (2024-12-29T09:35:56Z)
Catastrophic Failure of LLM Unlearning via Quantization [36.524827594501495]
We show that applying quantization to models that have undergone unlearning can restore the "forgotten" information. We find that for unlearning methods with utility constraints, the unlearned model retains an average of 21% of the intended forgotten knowledge in full precision.
arXiv Detail & Related papers (2024-10-21T19:28:37Z)
A Closer Look at Machine Unlearning for Large Language Models [46.245404272612795]
Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. We discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches.
arXiv Detail & Related papers (2024-10-10T16:56:05Z)
Position: LLM Unlearning Benchmarks are Weak Measures of Progress [31.957968729934745]
We find that existing benchmarks provide an overly optimistic and potentially misleading view on the effectiveness of candidate unlearning methods. We identify that existing benchmarks are particularly vulnerable to modifications that introduce even loose dependencies between the forget and retain information.
arXiv Detail & Related papers (2024-10-03T18:07:25Z)
MEOW: MEMOry Supervised LLM Unlearning Via Inverted Facts [29.593170782882563]
Large Language Models (LLMs) can memorize sensitive information, raising concerns about potential misuse. Previous practices face three key challenges: Utility, efficiency, and robustness. We propose MEOW, a gradient descent-based unlearning method.
arXiv Detail & Related papers (2024-09-18T09:55:48Z)
LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement [93.38736019287224]
"LLMs-as-Instructors" framework autonomously enhances the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model. Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors.
arXiv Detail & Related papers (2024-06-29T17:16:04Z)
Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs [18.629717934007513]
"SPlit, UNlearn, MerGE" (SPUNGE) is a framework that can be used with any unlearning method to amplify its effectiveness. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs.
arXiv Detail & Related papers (2024-06-17T17:35:52Z)
Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning [97.2995389188179]
Recent research has begun to approach large language models (LLMs) unlearning via gradient ascent (GA) Despite their simplicity and efficiency, we suggest that GA-based methods face the propensity towards excessive unlearning. We propose several controlling methods that can regulate the extent of excessive unlearning.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Offset Unlearning for Large Language Models [49.851093293780615]
Unlearning has emerged as a potential remedy for Large Language Models affected by problematic training data. We propose $delta$-unlearning, an offset unlearning framework for black-box LLMs. Experiments demonstrate that $delta$-unlearning can effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks.
arXiv Detail & Related papers (2024-04-17T03:39:51Z)
Alpaca against Vicuna: Using LLMs to Uncover Memorization of LLMs [61.04246774006429]
We introduce a black-box prompt optimization method that uses an attacker LLM agent to uncover higher levels of memorization in a victim agent. We observe that our instruction-based prompts generate outputs with 23.7% higher overlap with training data compared to the baseline prefix-suffix measurements. Our findings show that instruction-tuned models can expose pre-training data as much as their base-models, if not more so, and using instructions proposed by other LLMs can open a new avenue of automated attacks.
arXiv Detail & Related papers (2024-03-05T19:32:01Z)
Querying Easily Flip-flopped Samples for Deep Active Learning [63.62397322172216]
Active learning is a machine learning paradigm that aims to improve the performance of a model by strategically selecting and querying unlabeled data. One effective selection strategy is to base it on the model's predictive uncertainty, which can be interpreted as a measure of how informative a sample is. This paper proposes the it least disagree metric (LDM) as the smallest probability of disagreement of the predicted label.
arXiv Detail & Related papers (2024-01-18T08:12:23Z)
Supervised Knowledge Makes Large Language Models Better In-context Learners [94.89301696512776]
Large Language Models (LLMs) exhibit emerging in-context learning abilities through prompt engineering. The challenge of improving the generalizability and factuality of LLMs in natural language understanding and question answering remains under-explored. We propose a framework that enhances the reliability of LLMs as it: 1) generalizes out-of-distribution data, 2) elucidates how LLMs benefit from discriminative models, and 3) minimizes hallucinations in generative tasks.
arXiv Detail & Related papers (2023-12-26T07:24:46Z)
Language models are weak learners [71.33837923104808]
We show that prompt-based large language models can operate effectively as weak learners. We incorporate these models into a boosting approach, which can leverage the knowledge within the model to outperform traditional tree-based boosting. Results illustrate the potential for prompt-based LLMs to function not just as few-shot learners themselves, but as components of larger machine learning pipelines.
arXiv Detail & Related papers (2023-06-25T02:39:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.