Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
- URL: http://arxiv.org/abs/2406.11614v2
- Date: Fri, 04 Oct 2024 11:46:20 GMT
- Title: Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces
- Authors: Yihuai Hong, Lei Yu, Haiqin Yang, Shauli Ravfogel, Mor Geva,
- Abstract summary: "Unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently.
Current protocols to evaluate unlearning methods rely on behavioral tests, without monitoring the presence of associated knowledge.
We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts.
- Score: 34.00971641141313
- License:
- Abstract: The task of "unlearning" certain concepts in large language models (LLMs) has attracted immense attention recently, due to its importance in mitigating undesirable model behaviours, such as the generation of harmful, private, or incorrect information. Current protocols to evaluate unlearning methods largely rely on behavioral tests, without monitoring the presence of unlearned knowledge within the model's parameters. This residual knowledge can be adversarially exploited to recover the erased information post-unlearning. We argue that unlearning should also be evaluated internally, by considering changes in the parametric knowledge traces of the unlearned concepts. To this end, we propose a general evaluation methodology that leverages vocabulary projections to inspect concepts encoded in model parameters. We use this approach to localize "concept vectors" - parameter vectors that encode concrete concepts - and construct ConceptVectors, a benchmark dataset containing hundreds of common concepts and their parametric knowledge traces within two open-source LLMs. Evaluation on ConceptVectors shows that existing unlearning methods minimally impact concept vectors and mostly suppress them during inference, while directly ablating these vectors demonstrably removes the associated knowledge and significantly reduces the model's susceptibility to adversarial manipulation. Our results highlight limitations in behavioral-based unlearning evaluations and call for future work to include parameter-based evaluations. To support this, we release our code and benchmark at https://github.com/yihuaihong/ConceptVectors.
Related papers
- Interpret the Internal States of Recommendation Model with Sparse Autoencoder [26.021277330699963]
RecSAE is an automatic, generalizable probing method for interpreting the internal states of Recommendation models.
We train an autoencoder with sparsity constraints to reconstruct internal activations of recommendation models.
We automated the construction of concept dictionaries based on the relationship between latent activations and input item sequences.
arXiv Detail & Related papers (2024-11-09T08:22:31Z) - RESTOR: Knowledge Recovery through Machine Unlearning [71.75834077528305]
Large language models trained on web-scale corpora can memorize undesirable datapoints.
Many machine unlearning methods have been proposed that aim to 'erase' these datapoints from trained models.
We propose the RESTOR framework for machine unlearning based on the following dimensions.
arXiv Detail & Related papers (2024-10-31T20:54:35Z) - Unlearning or Concealment? A Critical Analysis and Evaluation Metrics for Unlearning in Diffusion Models [7.9993879763024065]
We show that the objective functions used for unlearning in the existing methods lead to decoupling of the targeted concepts for the corresponding prompts.
The ineffectiveness of current methods stems primarily from their narrow focus on reducing generation probabilities for specific prompt sets.
We introduce two new evaluation metrics: Concept Retrieval Score (CRS) and Concept Confidence Score (CCS)
arXiv Detail & Related papers (2024-09-09T14:38:31Z) - Unlearning with Control: Assessing Real-world Utility for Large Language Model Unlearning [97.2995389188179]
Recent research has begun to approach large language models (LLMs) unlearning via gradient ascent (GA)
Despite their simplicity and efficiency, we suggest that GA-based methods face the propensity towards excessive unlearning.
We propose several controlling methods that can regulate the extent of excessive unlearning.
arXiv Detail & Related papers (2024-06-13T14:41:00Z) - Interpretable Prognostics with Concept Bottleneck Models [5.939858158928473]
Concept Bottleneck Models (CBMs) are inherently interpretable neural network architectures based on concept explanations.
CBMs enable domain experts to intervene on the concept activations at test-time.
Our case studies demonstrate that the performance of CBMs can be on par or superior to black-box models.
arXiv Detail & Related papers (2024-05-27T18:15:40Z) - Seeking Neural Nuggets: Knowledge Transfer in Large Language Models from a Parametric Perspective [106.92016199403042]
We empirically investigate knowledge transfer from larger to smaller models through a parametric perspective.
We employ sensitivity-based techniques to extract and align knowledge-specific parameters between different large language models.
Our findings highlight the critical factors contributing to the process of parametric knowledge transfer.
arXiv Detail & Related papers (2023-10-17T17:58:34Z) - Towards Robust Metrics for Concept Representation Evaluation [25.549961337814523]
Concept learning models have been shown to be prone to encoding impurities in their representations.
We propose novel metrics for evaluating the purity of concept representations in both approaches.
arXiv Detail & Related papers (2023-01-25T00:40:19Z) - Evaluating Machine Unlearning via Epistemic Uncertainty [78.27542864367821]
This work presents an evaluation of Machine Unlearning algorithms based on uncertainty.
This is the first definition of a general evaluation of our best knowledge.
arXiv Detail & Related papers (2022-08-23T09:37:31Z) - Translational Concept Embedding for Generalized Compositional Zero-shot
Learning [73.60639796305415]
Generalized compositional zero-shot learning means to learn composed concepts of attribute-object pairs in a zero-shot fashion.
This paper introduces a new approach, termed translational concept embedding, to solve these two difficulties in a unified framework.
arXiv Detail & Related papers (2021-12-20T21:27:51Z) - Layer-wise Analysis of a Self-supervised Speech Representation Model [26.727775920272205]
Self-supervised learning approaches have been successful for pre-training speech representation models.
Not much has been studied about the type or extent of information encoded in the pre-trained representations themselves.
arXiv Detail & Related papers (2021-07-10T02:13:25Z) - Prototypical Contrastive Learning of Unsupervised Representations [171.3046900127166]
Prototypical Contrastive Learning (PCL) is an unsupervised representation learning method.
PCL implicitly encodes semantic structures of the data into the learned embedding space.
PCL outperforms state-of-the-art instance-wise contrastive learning methods on multiple benchmarks.
arXiv Detail & Related papers (2020-05-11T09:53:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.