ScEdit: Script-based Assessment of Knowledge Editing
- URL: http://arxiv.org/abs/2505.23291v2
- Date: Mon, 02 Jun 2025 14:05:59 GMT
- Title: ScEdit: Script-based Assessment of Knowledge Editing
- Authors: Xinye Li, Zunwen Zheng, Qian Zhang, Dekai Zhuang, Jiabao Kang, Liyan Xu, Qingbin Liu, Xi Chen, Zhiying Tu, Dianhui Chu, Dianbo Sui,
- Abstract summary: Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple.<n>We introduce a novel script-based benchmark -- ScEdit (Script-based Knowledge Editing Benchmark) -- which encompasses both counterfactual and temporal edits.<n>We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task.
- Score: 13.628279976661934
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge Editing (KE) has gained increasing attention, yet current KE tasks remain relatively simple. Under current evaluation frameworks, many editing methods achieve exceptionally high scores, sometimes nearing perfection. However, few studies integrate KE into real-world application scenarios (e.g., recent interest in LLM-as-agent). To support our analysis, we introduce a novel script-based benchmark -- ScEdit (Script-based Knowledge Editing Benchmark) -- which encompasses both counterfactual and temporal edits. We integrate token-level and text-level evaluation methods, comprehensively analyzing existing KE techniques. The benchmark extends traditional fact-based ("What"-type question) evaluation to action-based ("How"-type question) evaluation. We observe that all KE methods exhibit a drop in performance on established metrics and face challenges on text-level metrics, indicating a challenging task. Our benchmark is available at https://github.com/asdfo123/ScEdit.
Related papers
- Context Robust Knowledge Editing for Language Models [10.634048842551662]
We develop CHED, a benchmark designed to evaluate the context robustness of knowledge editing methods.<n> Evaluations on CHED show that they often fail when preceding contexts are present.<n>We introduce CoRE, a KE method designed to strengthen context robustness.
arXiv Detail & Related papers (2025-05-29T03:11:53Z) - Benchmarking and Rethinking Knowledge Editing for Large Language Models [34.80161437154527]
Knowledge editing aims to update embedded knowledge within Large Language Models (LLMs)<n>Existing approaches, whether through parameter modification or external memory integration, often suffer from inconsistent evaluation objectives and experimental setups.<n>This study offers new insights into the limitations of current knowledge editing methods and highlights the potential of context-based reasoning as a more robust alternative.
arXiv Detail & Related papers (2025-05-24T13:32:03Z) - The Mirage of Model Editing: Revisiting Evaluation in the Wild [70.17413507444704]
We study the effectiveness of model editing in question answering applications.<n>Our single editing experiments indicate that current editing methods perform substantially worse than previously reported.<n>Our analysis provides a fundamental reexamination of both the real-world applicability of existing model editing methods and their evaluation practices.
arXiv Detail & Related papers (2025-02-16T15:57:55Z) - ComprehendEdit: A Comprehensive Dataset and Evaluation Framework for Multimodal Knowledge Editing [27.034072044001736]
Large multimodal language models (MLLMs) have revolutionized natural language processing and visual understanding.<n>Current knowledge editing evaluations are limited in scope and potentially biased.<n>We introduce ComprehendEdit, a comprehensive benchmark comprising eight diverse tasks from multiple datasets.
arXiv Detail & Related papers (2024-12-17T11:41:49Z) - ETHIC: Evaluating Large Language Models on Long-Context Tasks with High Information Coverage [21.036912648701264]
We introduce a new metric called information coverage (IC), which quantifies the proportion of the input context necessary for answering queries.<n>We present ETHIC, a novel benchmark designed to assess LLMs' ability to leverage the entire context.
arXiv Detail & Related papers (2024-10-22T09:35:42Z) - Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph [83.90988015005934]
Uncertainty quantification is a key element of machine learning applications.<n>We introduce a novel benchmark that implements a collection of state-of-the-art UQ baselines.<n>We conduct a large-scale empirical investigation of UQ and normalization techniques across eleven tasks, identifying the most effective approaches.
arXiv Detail & Related papers (2024-06-21T20:06:31Z) - VLKEB: A Large Vision-Language Model Knowledge Editing Benchmark [53.091690659399234]
knowledge editing on large language models (LLMs) has received considerable attention.
The existing LVLM editing benchmark, which comprises three metrics (Reliability, Locality, and Generality), falls short in the quality of synthesized evaluation images.
We employ more reliable data collection methods to construct a new Large $textbfV$ision-$textbfL$anguage Model.
arXiv Detail & Related papers (2024-03-12T06:16:33Z) - DocTER: Evaluating Document-based Knowledge Editing [53.14000724633775]
We explore knowledge editing using easily accessible documents instead of manually labeled factual triples.<n>A comprehensive four-perspective evaluation is introduced: Edit Success, Locality, Reasoning, and Cross-lingual Transfer.<n>Experiments on popular knowledge editing methods demonstrate that editing with documents presents significantly greater challenges than using triples.
arXiv Detail & Related papers (2023-08-19T09:17:19Z) - EditEval: An Instruction-Based Benchmark for Text Improvements [73.5918084416016]
This work presents EditEval: An instruction-based, benchmark and evaluation suite for automatic evaluation of editing capabilities.
We evaluate several pre-trained models, which shows that InstructGPT and PEER perform the best, but that most baselines fall below the supervised SOTA.
Our analysis shows that commonly used metrics for editing tasks do not always correlate well, and that optimization for prompts with the highest performance does not necessarily entail the strongest robustness to different models.
arXiv Detail & Related papers (2022-09-27T12:26:05Z) - GO FIGURE: A Meta Evaluation of Factuality in Summarization [131.1087461486504]
We introduce GO FIGURE, a meta-evaluation framework for evaluating factuality evaluation metrics.
Our benchmark analysis on ten factuality metrics reveals that our framework provides a robust and efficient evaluation.
It also reveals that while QA metrics generally improve over standard metrics that measure factuality across domains, performance is highly dependent on the way in which questions are generated.
arXiv Detail & Related papers (2020-10-24T08:30:20Z) - Towards Question-Answering as an Automatic Metric for Evaluating the
Content Quality of a Summary [65.37544133256499]
We propose a metric to evaluate the content quality of a summary using question-answering (QA)
We demonstrate the experimental benefits of QA-based metrics through an analysis of our proposed metric, QAEval.
arXiv Detail & Related papers (2020-10-01T15:33:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.