Long-form evaluation of model editing
- URL: http://arxiv.org/abs/2402.09394v2
- Date: Fri, 29 Mar 2024 21:17:23 GMT
- Title: Long-form evaluation of model editing
- Authors: Domenic Rosati, Robie Gonzales, Jinkun Chen, Xuemin Yu, Melis Erkan, Yahya Kayani, Satya Deepika Chavatapalli, Frank Rudzicz, Hassan Sajjad,
- Abstract summary: We introduce long-form evaluation of model editing (LEME) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings.
We benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods.
- Score: 21.554925686287735
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Evaluations of model editing currently only use the `next few token' completions after a prompt. As a result, the impact of these methods on longer natural language generation is largely unknown. We introduce long-form evaluation of model editing (LEME) a novel evaluation protocol that measures the efficacy and impact of model editing in long-form generative settings. Our protocol consists of a machine-rated survey and a classifier which correlates well with human ratings. Importantly, we find that our protocol has very little relationship with previous short-form metrics (despite being designed to extend efficacy, generalization, locality, and portability into a long-form setting), indicating that our method introduces a novel set of dimensions for understanding model editing methods. Using this protocol, we benchmark a number of model editing techniques and present several findings including that, while some methods (ROME and MEMIT) perform well in making consistent edits within a limited scope, they suffer much more from factual drift than other methods. Finally, we present a qualitative analysis that illustrates common failure modes in long-form generative settings including internal consistency, lexical cohesion, and locality issues.
Related papers
- LEMoE: Advanced Mixture of Experts Adaptor for Lifelong Model Editing of Large Language Models [30.831866499812925]
Large language models (LLMs) require continual knowledge updates to stay abreast of the ever-changing world facts.
We introduce LEMoE, an advanced Mixture of Experts (MoE) adaptor for lifelong model editing.
arXiv Detail & Related papers (2024-06-28T16:17:41Z) - OLMES: A Standard for Language Model Evaluations [64.85905119836818]
We propose OLMES, a practical, open standard for reproducible language model evaluations.
We identify and review the varying factors in evaluation practices adopted by the community.
OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions.
arXiv Detail & Related papers (2024-06-12T17:37:09Z) - The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse [58.0132400208411]
Even a single edit can trigger model collapse, manifesting as significant performance degradation in various benchmark tasks.
benchmarking Large Language Models after each edit is impractically time-consuming and resource-intensive.
We have utilized GPT-3.5 to develop a new dataset, HardEdit, based on hard cases.
arXiv Detail & Related papers (2024-02-15T01:50:38Z) - Propagation and Pitfalls: Reasoning-based Assessment of Knowledge
Editing through Counterfactual Tasks [36.292901021210575]
We introduce a novel reasoning-based benchmark -- ReCoE (Reasoning-based Counterfactual Editing dataset)
We conduct a thorough analysis of existing knowledge editing techniques, including input augmentation, finetuning, and locate-and-edit.
All model editing methods show notably low performance on this dataset, especially in certain reasoning schemes.
arXiv Detail & Related papers (2024-01-31T04:12:59Z) - Model Editing at Scale leads to Gradual and Catastrophic Forgetting [2.569159339315845]
We evaluate the current model editing methods at scale, focusing on two state of the art methods: ROME and MEMIT.
We find that as the model is edited sequentially with multiple facts, it continually forgets previously edited facts and the ability to perform downstream tasks.
arXiv Detail & Related papers (2024-01-15T03:57:15Z) - Fast and Accurate Factual Inconsistency Detection Over Long Documents [19.86348214462828]
We introduce SCALE, a task-agnostic model for detecting factual inconsistencies using a novel chunking strategy.
This approach achieves state-of-the-art performance in factual inconsistency detection for diverse tasks and long inputs.
We have released our code and data publicly to GitHub.
arXiv Detail & Related papers (2023-10-19T22:55:39Z) - EvalCrafter: Benchmarking and Evaluating Large Video Generation Models [70.19437817951673]
We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities.
Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation.
Then, we evaluate the state-of-the-art video generative models on our carefully designed benchmark, in terms of visual qualities, content qualities, motion qualities, and text-video alignment with 17 well-selected objective metrics.
arXiv Detail & Related papers (2023-10-17T17:50:46Z) - Edit at your own risk: evaluating the robustness of edited models to
distribution shifts [0.0]
We investigate how model editing affects the general robustness of a model, as well as the robustness of the specific behavior targeted by the edit.
We find that edits tend to reduce general robustness, but that the degree of degradation depends on the editing algorithm and layers chosen.
Motivated by these observations we introduce a new model editing algorithm, 1-layer (1-LI), which uses weight-space to navigate the trade-off between editing task accuracy and general robustness.
arXiv Detail & Related papers (2023-02-28T19:41:37Z) - Memory-Based Model Editing at Scale [102.28475739907498]
Existing model editors struggle to accurately model an edit's intended scope.
We propose Semi-Parametric Editing with a Retrieval-Augmented Counterfactual Model (SERAC)
SERAC stores edits in an explicit memory and learns to reason over them to modulate the base model's predictions as needed.
arXiv Detail & Related papers (2022-06-13T23:40:34Z) - Towards a Unified View of Parameter-Efficient Transfer Learning [108.94786930869473]
Fine-tuning large pre-trained language models on downstream tasks has become the de-facto learning paradigm in NLP.
Recent work has proposed a variety of parameter-efficient transfer learning methods that only fine-tune a small number of (extra) parameters to attain strong performance.
We break down the design of state-of-the-art parameter-efficient transfer learning methods and present a unified framework that establishes connections between them.
arXiv Detail & Related papers (2021-10-08T20:22:26Z) - Pros and Cons of GAN Evaluation Measures: New Developments [53.10151901863263]
This work is an update of a previous paper on the same topic published a few years ago.
I describe new dimensions that are becoming important in assessing models, and discuss the connection between GAN evaluation and deepfakes.
arXiv Detail & Related papers (2021-03-17T01:48:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.