FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models
- URL: http://arxiv.org/abs/2402.14116v2
- Date: Thu, 6 Jun 2024 16:41:21 GMT
- Title: FanOutQA: A Multi-Hop, Multi-Document Question Answering Benchmark for Large Language Models
- Authors: Andrew Zhu, Alyssa Hwang, Liam Dugan, Chris Callison-Burch,
- Abstract summary: FanOutQA is a high-quality dataset of fan-out question-answer pairs and human-annotated decompositions with English Wikipedia.
We formulate three benchmark settings across our dataset and benchmark 7 LLMs, including GPT-4, LLaMA 2, Claude-2.1, and Mixtral-8x7B.
- Score: 37.34801677290571
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: One type of question that is commonly found in day-to-day scenarios is ``fan-out'' questions, complex multi-hop, multi-document reasoning questions that require finding information about a large number of entities. However, there exist few resources to evaluate this type of question-answering capability among large language models. To evaluate complex reasoning in LLMs more fully, we present FanOutQA, a high-quality dataset of fan-out question-answer pairs and human-annotated decompositions with English Wikipedia as the knowledge base. We formulate three benchmark settings across our dataset and benchmark 7 LLMs, including GPT-4, LLaMA 2, Claude-2.1, and Mixtral-8x7B, finding that contemporary models still have room to improve reasoning over inter-document dependencies in a long context. We provide our dataset and open-source tools to run models to encourage evaluation at https://fanoutqa.com
Related papers
- RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content [13.187520657952263]
Large Language Models (LLMs) are trained on vast amounts of data, most of which is automatically scraped from the internet.
evaluating models on test splits that might have leaked into the training set is prone to misleading conclusions.
We introduce a new test dataset named RepLiQA, suited for question-answering and topic retrieval tasks.
arXiv Detail & Related papers (2024-06-17T17:52:54Z) - Multi-LLM QA with Embodied Exploration [55.581423861790945]
We investigate the use of Multi-Embodied LLM Explorers (MELE) for question-answering in an unknown environment.
Multiple LLM-based agents independently explore and then answer queries about a household environment.
We analyze different aggregation methods to generate a single, final answer for each query.
arXiv Detail & Related papers (2024-06-16T12:46:40Z) - Read and Think: An Efficient Step-wise Multimodal Language Model for Document Understanding and Reasoning [0.0]
Existing document understanding models tend to generate answers with a single word or phrase directly.
We use Multi-modal Large Language Models (MLLMs) to generate step-wise question-and-answer pairs for document images.
We then use the generated high-quality data to train a humanized document understanding and reasoning model, dubbed DocAssistant.
arXiv Detail & Related papers (2024-02-26T01:17:50Z) - Towards Robust Temporal Reasoning of Large Language Models via a Multi-Hop QA Dataset and Pseudo-Instruction Tuning [73.51314109184197]
It is crucial for large language models (LLMs) to understand the concept of temporal knowledge.
We propose a complex temporal question-answering dataset Complex-TR that focuses on multi-answer and multi-hop temporal reasoning.
arXiv Detail & Related papers (2023-11-16T11:49:29Z) - Getting MoRE out of Mixture of Language Model Reasoning Experts [71.61176122960464]
We propose a Mixture-of-Reasoning-Experts (MoRE) framework that ensembles diverse specialized language models.
We specialize the backbone language model with prompts optimized for different reasoning categories, including factual, multihop, mathematical, and commonsense reasoning.
Our human study confirms that presenting expert predictions and the answer selection process helps annotators more accurately calibrate when to trust the system's output.
arXiv Detail & Related papers (2023-05-24T02:00:51Z) - Few-Shot Data Synthesis for Open Domain Multi-Hop Question Answering [40.86455734818704]
Few-shot learning for open domain multi-hop question answering typically relies on the incontext learning capability of large language models.
We propose a data synthesis framework for multi-hop question answering that requires less than 10 human annotated question answer pairs.
arXiv Detail & Related papers (2023-05-23T04:57:31Z) - Attributed Question Answering: Evaluation and Modeling for Attributed
Large Language Models [68.37431984231338]
Large language models (LLMs) have shown impressive results across a variety of tasks while requiring little or no direct supervision.
We believe the ability of an LLM to an attribute to the text that it generates is likely to be crucial for both system developers and users in this setting.
arXiv Detail & Related papers (2022-12-15T18:45:29Z) - How Well Do Multi-hop Reading Comprehension Models Understand Date
Information? [31.243088887839257]
The ability of multi-hop models to perform step-by-step reasoning when finding an answer to a comparison question remains unclear.
It is also unclear how questions about the internal reasoning process are useful for training and evaluating question-answering (QA) systems.
arXiv Detail & Related papers (2022-10-11T07:24:07Z) - MultiModalQA: Complex Question Answering over Text, Tables and Images [52.25399438133274]
We present MultiModalQA: a dataset that requires joint reasoning over text, tables and images.
We create MMQA using a new framework for generating complex multi-modal questions at scale.
We then define a formal language that allows us to take questions that can be answered from a single modality, and combine them to generate cross-modal questions.
arXiv Detail & Related papers (2021-04-13T09:14:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.