Related papers: Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

URL: http://arxiv.org/abs/2406.11614v2
Date: Fri, 04 Oct 2024 11:46:20 GMT
Title: Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
Authors: Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, Mor Geva,
Abstract summary: "Unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently. Current protocols to evaluate unlearning methods rely on behavioral tests, without monitoring the presence of associated knowledge. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts.
Score: 34.00971641141313
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize "concept vectors" - parameter vectors that encode concrete concepts - and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors and mostly suppress them during inference, while directly ablating these vectors demonstrably removes the associated knowledge and significantly reduces the model's susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parameter-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.

Related papers

Efficient Machine Unlearning via Influence Approximation [75.31015485113993]
Influence-based unlearning has emerged as a prominent approach to estimate the impact of individual training samples on model parameters without retraining.<n>This paper establishes a theoretical link between memorizing (incremental learning) and forgetting (unlearning)<n>We introduce the Influence Approximation Unlearning algorithm for efficient machine unlearning from the incremental perspective.
arXiv Detail & Related papers (2025-07-31T05:34:27Z)
Existing Large Language Model Unlearning Evaluations Are Inconclusive [105.55899615056573]
We show that some evaluations introduce substantial new information into the model, potentially masking true unlearning performance.<n>We demonstrate that evaluation outcomes vary significantly across tasks, undermining the generalizability of current evaluation routines.<n>We propose two principles for future unlearning evaluations: minimal information injection and downstream task awareness.
arXiv Detail & Related papers (2025-05-31T19:43:00Z)
Are We Truly Forgetting? A Critical Re-examination of Machine Unlearning Evaluation Protocols [14.961054239793356]
We conduct a new comprehensive evaluation that employs representation-based evaluations of unlearned model under large-scale scenarios. Our analysis reveals that current state-of-the-art unlearning approaches either completely degrade the representational quality of the unlearned model. We introduce a novel unlearning evaluation setup from a transfer learning perspective, in which the forget set classes exhibit semantic similarity to downstream task classes.
arXiv Detail & Related papers (2025-03-10T07:11:34Z)
Erasing Without Remembering: Implicit Knowledge Forgetting in Large Language Models [70.78205685001168]
We investigate knowledge forgetting in large language models with a focus on its generalisation.<n> UGBench is the first benchmark specifically designed to assess the unlearning of in-scope implicit knowledge.<n>We propose PerMU, a novel probability-based unlearning paradigm.
arXiv Detail & Related papers (2025-02-27T11:03:33Z)
ReLearn: Unlearning via Learning for Large Language Models [64.2802606302194]
We propose ReLearn, a data augmentation and fine-tuning pipeline for effective unlearning. This framework introduces Knowledge Forgetting Rate (KFR) and Knowledge Retention Rate (KRR) to measure knowledge-level preservation. Our experiments show that ReLearn successfully achieves targeted forgetting while preserving high-quality output.
arXiv Detail & Related papers (2025-02-16T16:31:00Z)
Sample-efficient Learning of Concepts with Theoretical Guarantees: from Data to Concepts without Interventions [7.3784937557132855]
Concept-based models (CBM) learn interpretable concepts from high-dimensional data, e.g. images, which are used to predict labels. An important issue in CBMs is concept leakage, i.e., spurious information in the learned concepts, which effectively leads to learning "wrong" concepts. We describe a framework that provides theoretical guarantees on the correctness of the learned concepts and on the number of required labels.
arXiv Detail & Related papers (2025-02-10T15:01:56Z)
Interpret the Internal States of Recommendation Model with Sparse Autoencoder [26.021277330699963]
RecSAE is an automatic, generalizable probing method for interpreting the internal states of Recommendation models. We train an autoencoder with sparsity constraints to reconstruct internal activations of recommendation models. We automated the construction of concept dictionaries based on the relationship between latent activations and input item sequences.
arXiv Detail & Related papers (2024-11-09T08:22:31Z)
RESTOR: Knowledge Recovery through Machine Unlearning [71.75834077528305]
Large language models trained on web-scale corpora can memorize undesirable datapoints. Many machine unlearning methods have been proposed that aim to 'erase' these datapoints from trained models. We propose the RESTOR framework for machine unlearning based on the following dimensions.
arXiv Detail & Related papers (2024-10-31T20:54:35Z)
Erasing Conceptual Knowledge from Language Models [24.63143961814566]
Erasure of Language Memory (ELM) is an approach for concept-level unlearning built on the principle of matching the distribution defined by an introspective classifier. ELM applies this framework to create targeted low-rank updates that reduce generation probabilities for concept-specific content. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks.
arXiv Detail & Related papers (2024-10-03T17:59:30Z)
Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models [7.9993879763024065]
We show that the objective functions used for unlearning in the existing methods lead to decoupling of the targeted concepts for the corresponding prompts. The ineffectiveness of current methods stems primarily from their narrow focus on reducing generation probabilities for specific prompt sets. We introduce two new evaluation metrics: Concept Retrieval Score (CRS) and Concept Confidence Score (CCS)
arXiv Detail & Related papers (2024-09-09T14:38:31Z)
Learn while Unlearn: An Iterative Unlearning Framework for Generative Language Models [52.03511469562013]
We introduce the Iterative Contrastive Unlearning (ICU) framework, which consists of three core components. A Knowledge Unlearning Induction module targets specific knowledge for removal using an unlearning loss. A Contrastive Learning Enhancement module preserves the model's expressive capabilities against the pure unlearning goal. An Iterative Unlearning Refinement module dynamically adjusts the unlearning process through ongoing evaluation and updates.
arXiv Detail & Related papers (2024-07-25T07:09:35Z)
Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning [97.2995389188179]
Recent research has begun to approach large language models (LLMs) unlearning via gradient ascent (GA) Despite their simplicity and efficiency, we suggest that GA-based methods face the propensity towards excessive unlearning. We propose several controlling methods that can regulate the extent of excessive unlearning.
arXiv Detail & Related papers (2024-06-13T14:41:00Z)
Interpretable Prognostics with Concept Bottleneck Models [5.939858158928473]
Concept Bottleneck Models (CBMs) are inherently interpretable neural network architectures based on concept explanations. CBMs enable domain experts to intervene on the concept activations at test-time. Our case studies demonstrate that the performance of CBMs can be on par or superior to black-box models.
arXiv Detail & Related papers (2024-05-27T18:15:40Z)
Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective [106.92016199403042]
We empirically investigate knowledge transfer from larger to smaller models through a parametric perspective. We employ sensitivity-based techniques to extract and align knowledge-specific parameters between different large language models. Our findings highlight the critical factors contributing to the process of parametric knowledge transfer.
arXiv Detail & Related papers (2023-10-17T17:58:34Z)
Towards Robust Metrics for Concept Representation Evaluation [25.549961337814523]
Concept learning models have been shown to be prone to encoding impurities in their representations. We propose novel metrics for evaluating the purity of concept representations in both approaches.
arXiv Detail & Related papers (2023-01-25T00:40:19Z)
Evaluating Machine Unlearning via Epistemic Uncertainty [78.27542864367821]
This work presents an evaluation of Machine Unlearning algorithms based on uncertainty. This is the first definition of a general evaluation of our best knowledge.
arXiv Detail & Related papers (2022-08-23T09:37:31Z)
Translational Concept Embedding for Generalized Compositional Zero-shot Learning [73.60639796305415]
Generalized compositional zero-shot learning means to learn composed concepts of attribute-object pairs in a zero-shot fashion. This paper introduces a new approach, termed translational concept embedding, to solve these two difficulties in a unified framework.
arXiv Detail & Related papers (2021-12-20T21:27:51Z)
Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models. Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z)
Prototypical Contrastive Learning of Unsupervised Representations [171.3046900127166]
Prototypical Contrastive Learning (PCL) is an unsupervised representation learning method. PCL implicitly encodes semantic structures of the data into the learned embedding space. PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks.
arXiv Detail & Related papers (2020-05-11T09:53:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.