Large Language Models Relearn Removed Concepts
- URL: http://arxiv.org/abs/2401.01814v1
- Date: Wed, 3 Jan 2024 16:15:57 GMT
- Title: Large Language Models Relearn Removed Concepts
- Authors: Michelle Lo, Shay B. Cohen, Fazl Barez
- Abstract summary: We evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining.
Our findings reveal that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics.
- Score: 21.733308901113137
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Advances in model editing through neuron pruning hold promise for removing
undesirable concepts from large language models. However, it remains unclear
whether models have the capacity to reacquire pruned concepts after editing. To
investigate this, we evaluate concept relearning in models by tracking concept
saliency and similarity in pruned neurons during retraining. Our findings
reveal that models can quickly regain performance post-pruning by relocating
advanced concepts to earlier layers and reallocating pruned concepts to primed
neurons with similar semantics. This demonstrates that models exhibit
polysemantic capacities and can blend old and new concepts in individual
neurons. While neuron pruning provides interpretability into model concepts,
our results highlight the challenges of permanent concept removal for improved
model \textit{safety}. Monitoring concept reemergence and developing techniques
to mitigate relearning of unsafe concepts will be important directions for more
robust model editing. Overall, our work strongly demonstrates the resilience
and fluidity of concept representations in LLMs post concept removal.
Related papers
- How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization? [91.49559116493414]
We propose a novel Concept-Incremental text-to-image Diffusion Model (CIDM)
It can resolve catastrophic forgetting and concept neglect to learn new customization tasks in a concept-incremental manner.
Experiments validate that our CIDM surpasses existing custom diffusion models.
arXiv Detail & Related papers (2024-10-23T06:47:29Z) - Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models [76.39651111467832]
We introduce Reliable and Efficient Concept Erasure (RECE), a novel approach that modifies the model in 3 seconds without necessitating additional fine-tuning.
To mitigate inappropriate content potentially represented by derived embeddings, RECE aligns them with harmless concepts in cross-attention layers.
The derivation and erasure of new representation embeddings are conducted iteratively to achieve a thorough erasure of inappropriate concepts.
arXiv Detail & Related papers (2024-07-17T08:04:28Z) - Concept Bottleneck Models Without Predefined Concepts [26.156636891713745]
We introduce an input-dependent concept selection mechanism that ensures only a small subset of concepts is used across all classes.
We show that our approach improves downstream performance and narrows the performance gap to black-box models.
arXiv Detail & Related papers (2024-07-04T13:34:50Z) - Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient [20.091446060893638]
This paper proposes a concept domain correction framework for unlearning concepts in diffusion models.
By aligning the output domains of sensitive concepts and anchor concepts through adversarial training, we enhance the generalizability of the unlearning results.
arXiv Detail & Related papers (2024-05-24T07:47:36Z) - Improving Intervention Efficacy via Concept Realignment in Concept Bottleneck Models [57.86303579812877]
Concept Bottleneck Models (CBMs) ground image classification on human-understandable concepts to allow for interpretable model decisions.
Existing approaches often require numerous human interventions per image to achieve strong performances.
We introduce a trainable concept realignment intervention module, which leverages concept relations to realign concept assignments post-intervention.
arXiv Detail & Related papers (2024-05-02T17:59:01Z) - Separable Multi-Concept Erasure from Diffusion Models [52.51972530398691]
We propose a Separable Multi-concept Eraser (SepME) to eliminate unsafe concepts from large-scale diffusion models.
The latter separates optimizable model weights, making each weight increment correspond to a specific concept erasure.
Extensive experiments indicate the efficacy of our approach in eliminating concepts, preserving model performance, and offering flexibility in the erasure or recovery of various concepts.
arXiv Detail & Related papers (2024-02-03T11:10:57Z) - Knowledge-Aware Neuron Interpretation for Scene Classification [32.32713349524347]
We propose a knowledge-aware neuron interpretation framework to explain model predictions for image scene classification.
For concept completeness, we present core concepts of a scene based on knowledge graph, ConceptNet, to gauge the completeness of concepts.
For concept fusion, we introduce a knowledge graph-based method known as Concept Filtering, which produces over 23% point gain on neuron behaviors for neuron interpretation.
arXiv Detail & Related papers (2024-01-29T01:00:17Z) - Concept Distillation: Leveraging Human-Centered Explanations for Model
Improvement [3.026365073195727]
Concept Activation Vectors (CAVs) estimate a model's sensitivity and possible biases to a given concept.
We extend CAVs from post-hoc analysis to ante-hoc training in order to reduce model bias through fine-tuning.
We show applications of concept-sensitive training to debias several classification problems.
arXiv Detail & Related papers (2023-11-26T14:00:14Z) - Interpreting Pretrained Language Models via Concept Bottlenecks [55.47515772358389]
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks.
The lack of interpretability due to their black-box'' nature poses challenges for responsible implementation.
We propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans.
arXiv Detail & Related papers (2023-11-08T20:41:18Z) - A Recursive Bateson-Inspired Model for the Generation of Semantic Formal
Concepts from Spatial Sensory Data [77.34726150561087]
This paper presents a new symbolic-only method for the generation of hierarchical concept structures from complex sensory data.
The approach is based on Bateson's notion of difference as the key to the genesis of an idea or a concept.
The model is able to produce fairly rich yet human-readable conceptual representations without training.
arXiv Detail & Related papers (2023-07-16T15:59:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.