How Robust is Model Editing after Fine-Tuning? An Empirical Study on Text-to-Image Diffusion Models
- URL: http://arxiv.org/abs/2506.18428v1
- Date: Mon, 23 Jun 2025 09:10:29 GMT
- Title: How Robust is Model Editing after Fine-Tuning? An Empirical Study on Text-to-Image Diffusion Models
- Authors: Feng He, Zhenyang Liu, Marco Valentino, Zhixue Zhao,
- Abstract summary: We investigate the interaction between model editing and fine-tuning in the context of T2I diffusion models.<n>Our findings reveal a trend: edits generally fail to persist through fine-tuning, even when fine-tuning is tangential or unrelated to the edits.<n>These findings highlight the need for more robust techniques to ensure reliable long-term control and alignment of deployed AI systems.
- Score: 7.342540592387184
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Model editing offers a low-cost technique to inject or correct a particular behavior in a pre-trained model without extensive retraining, supporting applications such as factual correction and bias mitigation. Despite this common practice, it remains unknown whether edits persist after fine-tuning or whether they are inadvertently reversed. This question has fundamental practical implications. For example, if fine-tuning removes prior edits, it could serve as a defence mechanism against hidden malicious edits. Vice versa, the unintended removal of edits related to bias mitigation could pose serious safety concerns. We systematically investigate the interaction between model editing and fine-tuning in the context of T2I diffusion models, which are known to exhibit biases and generate inappropriate content. Our study spans two T2I model families (Stable Diffusion and FLUX), two sota editing techniques, and three fine-tuning methods (DreamBooth, LoRA, and DoRA). Through an extensive empirical analysis across diverse editing tasks and evaluation metrics, our findings reveal a trend: edits generally fail to persist through fine-tuning, even when fine-tuning is tangential or unrelated to the edits. Notably, we observe that DoRA exhibits the strongest edit reversal effect. At the same time, among editing methods, UCE demonstrates greater robustness, retaining significantly higher efficacy post-fine-tuning compared to ReFACT. These findings highlight a crucial limitation in current editing methodologies, emphasizing the need for more robust techniques to ensure reliable long-term control and alignment of deployed AI systems. These findings have dual implications for AI safety: they suggest that fine-tuning could serve as a remediation mechanism for malicious edits while simultaneously highlighting the need for re-editing after fine-tuning to maintain beneficial safety and alignment properties.
Related papers
- Tracing and Reversing Rank-One Model Edits [5.260519479124422]
This work investigates the traceability and reversibility of knowledge edits, focusing on the widely used Rank-One Model Editing (ROME) method.<n>We show that ROME introduces distinctive distributional patterns in the edited weight matrices, which can serve as effective signals for locating the edited weights.<n>We propose a method to infer the edited object entity directly from the modified weights, without access to the editing prompt, achieving over 95% accuracy.
arXiv Detail & Related papers (2025-05-27T07:27:01Z) - Tuning-Free Image Editing with Fidelity and Editability via Unified Latent Diffusion Model [60.82962950960996]
We introduce UnifyEdit, a tuning-free method that performs diffusion latent optimization.<n>We develop two attention-based constraints: a self-attention (SA) preservation constraint for structural fidelity, and a cross-attention (CA) alignment constraint to enhance text alignment.<n>Our approach achieves a robust balance between structure preservation and text alignment across various editing tasks, outperforming other state-of-the-art methods.
arXiv Detail & Related papers (2025-04-08T01:02:50Z) - The Mirage of Model Editing: Revisiting Evaluation in the Wild [70.17413507444704]
We introduce QAEdit, a new benchmark aligned with widely used question answering (QA) datasets, and WILD, a task-agnostic evaluation framework.<n>Our single editing experiments show that current editing methods perform substantially worse than previously reported.
arXiv Detail & Related papers (2025-02-16T15:57:55Z) - Better Call SAUL: Fluent and Consistent Language Model Editing with Generation Regularization [48.07144492109635]
Large language models need to be updated regularly.
Model editing is challenging as it might also affect knowledge that is unrelated to the new data.
We propose SAUL, a streamlined model editing method that uses sentence concatenation with augmented random facts for generation regularization.
arXiv Detail & Related papers (2024-10-03T12:28:13Z) - Potential and Challenges of Model Editing for Social Debiasing [20.186721346693577]
Large language models (LLMs) trained on vast corpora suffer from inevitable stereotype biases.
Mitigating these biases with fine-tuning could be both costly and data-hungry.
Model editing methods, which focus on modifying LLMs in a post-hoc manner, are of great potential to address debiasing.
arXiv Detail & Related papers (2024-02-21T01:35:26Z) - Model Editing by Standard Fine-Tuning [9.344592764040964]
We show that standard fine-tuning alone can yield competitive model editing performance with two minor modifications.
First, we optimize the conditional likelihood rather than the full likelihood.
Second, in addition to the typical practice of training on randomly paraphrased edit prompts to encourage generalization, we also train on random or similar unedited facts to encourage locality.
arXiv Detail & Related papers (2024-02-16T21:10:33Z) - The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse [58.0132400208411]
Even a single edit can trigger model collapse, manifesting as significant performance degradation in various benchmark tasks.
benchmarking Large Language Models after each edit is impractically time-consuming and resource-intensive.
We have utilized GPT-3.5 to develop a new dataset, HardEdit, based on hard cases.
arXiv Detail & Related papers (2024-02-15T01:50:38Z) - Model Editing Harms General Abilities of Large Language Models: Regularization to the Rescue [122.20016030723043]
We evaluate the side effects of model editing on large language models (LLMs)
Our analysis reveals that the side effects are caused by model editing altering the original model weights excessively.
To mitigate this, a method named RECT is proposed to regularize the edit update weights.
arXiv Detail & Related papers (2024-01-09T18:03:15Z) - Edit at your own risk: evaluating the robustness of edited models to
distribution shifts [0.0]
We investigate how model editing affects the general robustness of a model, as well as the robustness of the specific behavior targeted by the edit.
We find that edits tend to reduce general robustness, but that the degree of degradation depends on the editing algorithm and layers chosen.
Motivated by these observations we introduce a new model editing algorithm, 1-layer (1-LI), which uses weight-space to navigate the trade-off between editing task accuracy and general robustness.
arXiv Detail & Related papers (2023-02-28T19:41:37Z) - Memory-Based Model Editing at Scale [102.28475739907498]
Existing model editors struggle to accurately model an edit's intended scope.
We propose Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model (SERAC)
SERAC stores edits in an explicit memory and learns to reason over them to modulate the base model's predictions as needed.
arXiv Detail & Related papers (2022-06-13T23:40:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.