COMPKE: Complex Question Answering under Knowledge Editing
- URL: http://arxiv.org/abs/2506.00829v2
- Date: Tue, 03 Jun 2025 16:03:55 GMT
- Title: COMPKE: Complex Question Answering under Knowledge Editing
- Authors: Keyuan Cheng, Zijian Kan, Zhixian He, Zhuoran Zhang, Muhammad Asif Ali, Ke Xu, Lijie Hu, Di Wang,
- Abstract summary: Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge.<n>We introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations.<n>We conduct an extensive evaluation of four knowledge editing methods on COMPKE, revealing that their effectiveness varies notably across different models.
- Score: 10.447078471142044
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge Editing, which efficiently modifies the knowledge in large language models, has gathered great attention. Current benchmarks primarily use multi-hop question answering to assess and analyze newly injected or updated knowledge. However, we argue that these benchmarks fail to effectively evaluate how well the updated models apply this knowledge in real-life scenarios, particularly when questions require complex reasoning, involving one-to-many relationships or multi-step logical intersections. To fill in this gap, we introduce a new benchmark, COMPKE: Complex Question Answering under Knowledge Editing, which includes 11,924 complex questions that reflect real-life situations. We conduct an extensive evaluation of four knowledge editing methods on COMPKE, revealing that their effectiveness varies notably across different models. For instance, MeLLo attains an accuracy of 39.47 on GPT-4O-MINI, but this drops sharply to 3.83 on QWEN2.5-3B. We further investigate the underlying causes of these disparities from both methodological and model-specific perspectives. The datasets are available at https://github.com/kzjkzj666/CompKE.
Related papers
- Knowledge Editing for Multi-Hop Question Answering Using Semantic Analysis [4.926795473283984]
Large Language Models (LLMs) require lightweight avenues of updating stored information that has fallen out of date.<n>We propose a knowledge editor for MQA based on semantic analysis called CHECK.
arXiv Detail & Related papers (2025-07-29T19:58:22Z) - PropMEND: Hypernetworks for Knowledge Propagation in LLMs [82.99849359892112]
We present a hypernetwork-based approach for knowledge propagation, named PropMEND.<n>We show almost 2x accuracy on challenging multi-hop questions whose answers are not explicitly stated in the injected fact.<n>We also introduce a new dataset, Controlled RippleEdit, to evaluate the generalization of our hypernetwork.
arXiv Detail & Related papers (2025-06-10T15:44:19Z) - MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge [24.66666826440994]
MINTQA is a benchmark to evaluate large language models' capabilities in multi-hop reasoning.<n> MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge.<n>Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries.
arXiv Detail & Related papers (2024-12-22T14:17:12Z) - Konstruktor: A Strong Baseline for Simple Knowledge Graph Question Answering [60.6042489577575]
We introduce Konstruktor - an efficient and robust approach that breaks down the problem into three steps.
Our approach integrates language models and knowledge graphs, exploiting the power of the former and the interpretability of the latter.
We show that for relation detection, the most challenging step of the workflow, a combination of relation classification/generation and ranking outperforms other methods.
arXiv Detail & Related papers (2024-09-24T09:19:11Z) - LLM-Based Multi-Hop Question Answering with Knowledge Graph Integration in Evolving Environments [35.3938477255058]
This paper introduces Graph Memory-based Editing for Large Language Models (GMeLLo)<n>It is a straightforward and effective method that merges the explicit knowledge representation of Knowledge Graphs with the linguistic flexibility of Large Language Models.<n>Our results show that GMeLLo significantly surpasses current state-of-the-art knowledge editing methods in the multi-hop question answering benchmark, MQuAKE.
arXiv Detail & Related papers (2024-08-28T16:15:45Z) - Establishing Knowledge Preference in Language Models [80.70632813935644]
Language models are known to encode a great amount of factual knowledge through pretraining.
Such knowledge might be insufficient to cater to user requests.
When answering questions about ongoing events, the model should use recent news articles to update its response.
When some facts are edited in the model, the updated facts should override all prior knowledge learned by the model.
arXiv Detail & Related papers (2024-07-17T23:16:11Z) - Leveraging Logical Rules in Knowledge Editing: A Cherry on the Top [12.982138813457812]
Multi-hop Question Answering (MQA) under knowledge editing (KE) is a key challenge in Large Language Models (LLMs)
We propose a novel framework named RULE-KE, i.e., RULE based Knowledge Editing, which is a cherry on the top for augmenting the performance of all existing MQA methods under KE.
Experimental evaluation using existing and newly curated datasets shows that RULE-KE helps augment both performances of parameter-based and memory-based solutions up to 92% and 112.9%, respectively.
arXiv Detail & Related papers (2024-05-24T11:30:00Z) - Retrieval-enhanced Knowledge Editing in Language Models for Multi-Hop Question Answering [47.199078631274745]
Large Language Models (LLMs) have shown proficiency in question-answering tasks but often struggle to integrate real-time knowledge.
We propose the Retrieval-Augmented model Editing (RAE) framework for multi-hop question answering.
arXiv Detail & Related papers (2024-03-28T17:47:19Z) - Robust and Scalable Model Editing for Large Language Models [75.95623066605259]
We propose EREN (Edit models by REading Notes) to improve the scalability and robustness of LLM editing.
Unlike existing techniques, it can integrate knowledge from multiple edits, and correctly respond to syntactically similar but semantically unrelated inputs.
arXiv Detail & Related papers (2024-03-26T06:57:23Z) - R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges.
Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not.
We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning)
Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z) - RECKONING: Reasoning through Dynamic Knowledge Encoding [51.076603338764706]
We show that language models can answer questions by reasoning over knowledge provided as part of the context.
In these situations, the model fails to distinguish the knowledge that is necessary to answer the question.
We propose teaching the model to reason more robustly by folding the provided contextual knowledge into the model's parameters.
arXiv Detail & Related papers (2023-05-10T17:54:51Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - What Does My QA Model Know? Devising Controlled Probes using Expert
Knowledge [36.13528043657398]
We investigate whether state-of-the-art QA models have general knowledge about word definitions and general taxonomic reasoning.
We use a methodology for automatically building datasets from various types of expert knowledge.
Our evaluation confirms that transformer-based QA models are already predisposed to recognize certain types of structural lexical knowledge.
arXiv Detail & Related papers (2019-12-31T15:05:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.