MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge
- URL: http://arxiv.org/abs/2412.17032v2
- Date: Tue, 28 Jan 2025 16:28:10 GMT
- Title: MINTQA: A Multi-Hop Question Answering Benchmark for Evaluating LLMs on New and Tail Knowledge
- Authors: Jie He, Nan Hu, Wanqiu Long, Jiaoyan Chen, Jeff Z. Pan,
- Abstract summary: MINTQA is a benchmark to evaluate large language models' capabilities in multi-hop reasoning.
MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge.
Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries.
- Score: 24.66666826440994
- License:
- Abstract: Large language models (LLMs) have demonstrated impressive capabilities in various reasoning tasks but face significant challenges with complex, knowledge-intensive multi-hop queries, particularly those involving new or long-tail knowledge. Existing benchmarks often fail to fully address these challenges. To bridge this gap, we introduce MINTQA (Multi-hop Question Answering on New and Tail Knowledge), a comprehensive benchmark to evaluate LLMs' capabilities in multi-hop reasoning across four critical dimensions: question handling strategy, sub-question generation, retrieval-augmented generation, and iterative or dynamic decomposition and retrieval. MINTQA comprises 10,479 question-answer pairs for evaluating new knowledge and 17,887 pairs for assessing long-tail knowledge, with each question equipped with corresponding sub-questions and answers. Our systematic evaluation of 22 state-of-the-art LLMs on MINTQA reveals significant limitations in their ability to handle complex knowledge base queries, particularly in handling new or unpopular knowledge. Our findings highlight critical challenges and offer insights for advancing multi-hop reasoning capabilities. The MINTQA benchmark is available at https://github.com/probe2/multi-hop/.
Related papers
- An Entailment Tree Generation Approach for Multimodal Multi-Hop Question Answering with Mixture-of-Experts and Iterative Feedback Mechanism [14.479060028732803]
We argue that the current methods of multi-modal multi-hop question answering still mainly face two challenges.
The retrieved evidence containing a large amount of redundant information leads to a significant drop in performance.
The reasoning process without interpretable reasoning steps makes the model difficult to discover the logical errors for handling complex questions.
arXiv Detail & Related papers (2024-12-08T05:47:55Z) - LLM-Based Multi-Hop Question Answering with Knowledge Graph Integration in Evolving Environments [35.3938477255058]
This paper introduces Graph Memory-based Editing for Large Language Models (GMeLLo)
It is a straightforward and effective method that merges the explicit knowledge representation of Knowledge Graphs with the linguistic flexibility of Large Language Models.
Our results show that GMeLLo significantly surpasses current state-of-the-art knowledge editing methods in the multi-hop question answering benchmark, MQuAKE.
arXiv Detail & Related papers (2024-08-28T16:15:45Z) - Knowledge Tagging System on Math Questions via LLMs with Flexible Demonstration Retriever [48.5585921817745]
Large Language Models (LLMs) are used to automate the knowledge tagging task.
We show the strong performance of zero- and few-shot results over math questions knowledge tagging tasks.
By proposing a reinforcement learning-based demonstration retriever, we successfully exploit the great potential of different-sized LLMs.
arXiv Detail & Related papers (2024-06-19T23:30:01Z) - Prompting Large Language Models with Knowledge Graphs for Question Answering Involving Long-tail Facts [50.06633829833144]
Large Language Models (LLMs) are effective in performing various NLP tasks, but struggle to handle tasks that require extensive, real-world knowledge.
We propose a benchmark that requires knowledge of long-tail facts for answering the involved questions.
Our experiments show that LLMs alone struggle with answering these questions, especially when the long-tail level is high or rich knowledge is required.
arXiv Detail & Related papers (2024-05-10T15:10:20Z) - Retrieval-enhanced Knowledge Editing in Language Models for Multi-Hop Question Answering [47.199078631274745]
Large Language Models (LLMs) have shown proficiency in question-answering tasks but often struggle to integrate real-time knowledge.
We propose the Retrieval-Augmented model Editing (RAE) framework for multi-hop question answering.
arXiv Detail & Related papers (2024-03-28T17:47:19Z) - Direct Evaluation of Chain-of-Thought in Multi-hop Reasoning with Knowledge Graphs [52.42505579545893]
Large language models (LLMs) demonstrate strong reasoning abilities when prompted to generate chain-of-thought explanations alongside answers.
We propose a novel discriminative and generative CoT evaluation paradigm to assess LLMs' knowledge of reasoning and the accuracy of the generated CoT.
arXiv Detail & Related papers (2024-02-17T05:22:56Z) - GenDec: A robust generative Question-decomposition method for Multi-hop
reasoning [32.12904215053187]
Multi-hop QA involves step-by-step reasoning to answer complex questions.
Existing large language models'(LLMs) reasoning ability in multi-hop question answering remains exploration.
It is unclear whether LLMs follow a desired reasoning chain to reach the right final answer.
arXiv Detail & Related papers (2024-02-17T02:21:44Z) - PokeMQA: Programmable knowledge editing for Multi-hop Question Answering [46.80110170981976]
Multi-hop question answering (MQA) is one of the challenging tasks to evaluate machine's comprehension and reasoning abilities.
We propose a framework, Programmable knowledge editing for Multi-hop Question Answering (MQA)
Specifically, we prompt LLMs to decompose knowledge-augmented multi-hop question, while interacting with a detached trainable scope detector to modulate LLMs behavior depending on external conflict signal.
arXiv Detail & Related papers (2023-12-23T08:32:13Z) - Knowledge Crosswords: Geometric Knowledge Reasoning with Large Language Models [49.23348672822087]
We propose Knowledge Crosswords, a benchmark consisting of incomplete knowledge networks bounded by structured factual constraints.
The novel setting of geometric knowledge reasoning necessitates new LM abilities beyond existing atomic/linear multi-hop QA.
We conduct extensive experiments to evaluate existing LLMs and approaches on Knowledge Crosswords.
arXiv Detail & Related papers (2023-10-02T15:43:53Z) - Rethinking Label Smoothing on Multi-hop Question Answering [87.68071401870283]
Multi-Hop Question Answering (MHQA) is a significant area in question answering.
In this work, we analyze the primary factors limiting the performance of multi-hop reasoning.
We propose a novel label smoothing technique, F1 Smoothing, which incorporates uncertainty into the learning process.
arXiv Detail & Related papers (2022-12-19T14:48:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.