Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents
- URL: http://arxiv.org/abs/2410.00526v1
- Date: Tue, 1 Oct 2024 09:10:00 GMT
- Title: Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents
- Authors: Shiwei Wu, Chen Zhang, Yan Gao, Qimeng Wang, Tong Xu, Yao Hu, Enhong Chen,
- Abstract summary: We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
- Score: 61.41316121093604
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Instructional documents are rich sources of knowledge for completing various tasks, yet their unique challenges in conversational question answering (CQA) have not been thoroughly explored. Existing benchmarks have primarily focused on basic factual question-answering from single narrative documents, making them inadequate for assessing a model`s ability to comprehend complex real-world instructional documents and provide accurate step-by-step guidance in daily life. To bridge this gap, we present InsCoQA, a novel benchmark tailored for evaluating large language models (LLMs) in the context of CQA with instructional documents. Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents, reflecting the intricate and multi-faceted nature of real-world instructional tasks. Additionally, to comprehensively assess state-of-the-art LLMs on the InsCoQA benchmark, we propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
Related papers
- Beyond Relevant Documents: A Knowledge-Intensive Approach for Query-Focused Summarization using Large Language Models [27.90653125902507]
We propose a knowledge-intensive approach that reframes query-focused summarization as a knowledge-intensive task setup.
The retrieval module efficiently retrieves potentially relevant documents from a large-scale knowledge corpus.
The summarization controller seamlessly integrates a powerful large language model (LLM)-based summarizer with a carefully tailored prompt.
arXiv Detail & Related papers (2024-08-19T18:54:20Z) - SEAM: A Stochastic Benchmark for Multi-Document Tasks [30.153949809172605]
There is currently no benchmark which measures abilities of large language models (LLMs) on multi-document tasks.
We present SEAM (a Evaluation Approach for Multi-document tasks), a conglomerate benchmark over a diverse set of multi-document datasets.
We find that multi-document tasks pose a significant challenge for LLMs, even for state-of-the-art models with 70B parameters.
arXiv Detail & Related papers (2024-06-23T11:57:53Z) - KIWI: A Dataset of Knowledge-Intensive Writing Instructions for
Answering Research Questions [63.307317584926146]
Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents.
In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer.
We construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain.
arXiv Detail & Related papers (2024-03-06T17:16:44Z) - PROXYQA: An Alternative Framework for Evaluating Long-Form Text Generation with Large Language Models [72.57329554067195]
ProxyQA is an innovative framework dedicated to assessing longtext generation.
It comprises in-depth human-curated meta-questions spanning various domains, each accompanied by specific proxy-questions with pre-annotated answers.
It assesses the generated content's quality through the evaluator's accuracy in addressing the proxy-questions.
arXiv Detail & Related papers (2024-01-26T18:12:25Z) - Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions.
Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions.
We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z) - Benchmarking Foundation Models with Language-Model-as-an-Examiner [47.345760054595246]
We propose a novel benchmarking framework, Language-Model-as-an-Examiner.
The LM serves as a knowledgeable examiner that formulates questions based on its knowledge and evaluates responses in a reference-free manner.
arXiv Detail & Related papers (2023-06-07T06:29:58Z) - Recitation-Augmented Language Models [85.30591349383849]
We show that RECITE is a powerful paradigm for knowledge-intensive NLP tasks.
Specifically, we show that by utilizing recitation as the intermediate step, a recite-and-answer scheme can achieve new state-of-the-art performance.
arXiv Detail & Related papers (2022-10-04T00:49:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.