KIWI: A Dataset of Knowledge-Intensive Writing Instructions for
Answering Research Questions
- URL: http://arxiv.org/abs/2403.03866v1
- Date: Wed, 6 Mar 2024 17:16:44 GMT
- Title: KIWI: A Dataset of Knowledge-Intensive Writing Instructions for
Answering Research Questions
- Authors: Fangyuan Xu, Kyle Lo, Luca Soldaini, Bailey Kuehl, Eunsol Choi, David
Wadden
- Abstract summary: Large language models (LLMs) adapted to follow user instructions are now widely deployed as conversational agents.
In this work, we examine one increasingly common instruction-following task: providing writing assistance to compose a long-form answer.
We construct KIWI, a dataset of knowledge-intensive writing instructions in the scientific domain.
- Score: 63.307317584926146
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) adapted to follow user instructions are now
widely deployed as conversational agents. In this work, we examine one
increasingly common instruction-following task: providing writing assistance to
compose a long-form answer. To evaluate the capabilities of current LLMs on
this task, we construct KIWI, a dataset of knowledge-intensive writing
instructions in the scientific domain. Given a research question, an initial
model-generated answer and a set of relevant papers, an expert annotator
iteratively issues instructions for the model to revise and improve its answer.
We collect 1,260 interaction turns from 234 interaction sessions with three
state-of-the-art LLMs. Each turn includes a user instruction, a model response,
and a human evaluation of the model response. Through a detailed analysis of
the collected responses, we find that all models struggle to incorporate new
information into an existing answer, and to perform precise and unambiguous
edits. Further, we find that models struggle to judge whether their outputs
successfully followed user instructions, with accuracy at least 10 points short
of human agreement. Our findings indicate that KIWI will be a valuable resource
to measure progress and improve LLMs' instruction-following capabilities for
knowledge intensive writing tasks.
Related papers
- Rewriting Conversational Utterances with Instructed Large Language Models [9.38751103209178]
Large language models (LLMs) can achieve state-of-the-art performance on many NLP tasks.
We study which prompts provide the most informative utterances that lead to the best retrieval performance.
The results show that rewriting conversational utterances with instructed LLMs achieves significant improvements of up to 25.2% in MRR, 31.7% in Precision@1, 27% in NDCG@3, and 11.5% in Recall@500 over state-of-the-art techniques.
arXiv Detail & Related papers (2024-10-10T10:30:28Z) - Benchmarking Large Language Models for Conversational Question Answering in Multi-instructional Documents [61.41316121093604]
We present InsCoQA, a novel benchmark for evaluating large language models (LLMs) in the context of conversational question answering (CQA)
Sourced from extensive, encyclopedia-style instructional content, InsCoQA assesses models on their ability to retrieve, interpret, and accurately summarize procedural guidance from multiple documents.
We also propose InsEval, an LLM-assisted evaluator that measures the integrity and accuracy of generated responses and procedural instructions.
arXiv Detail & Related papers (2024-10-01T09:10:00Z) - FollowIR: Evaluating and Teaching Information Retrieval Models to Follow Instructions [71.5977045423177]
We study the use of instructions in Information Retrieval systems.
We introduce our dataset FollowIR, which contains a rigorous instruction evaluation benchmark.
We show that it is possible for IR models to learn to follow complex instructions.
arXiv Detail & Related papers (2024-03-22T14:42:29Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - From Language Modeling to Instruction Following: Understanding the Behavior Shift in LLMs after Instruction Tuning [63.63840740526497]
We investigate how instruction tuning adjusts pre-trained models with a focus on intrinsic changes.
The impact of instruction tuning is then studied by comparing the explanations derived from the pre-trained and instruction-tuned models.
Our findings reveal three significant impacts of instruction tuning.
arXiv Detail & Related papers (2023-09-30T21:16:05Z) - Enabling Large Language Models to Generate Text with Citations [37.64884969997378]
Large language models (LLMs) have emerged as a widely-used tool for information seeking.
Our aim is to allow LLMs to generate text with citations, improving their factual correctness and verifiability.
We propose ALCE, the first benchmark for Automatic LLMs' Citation Evaluation.
arXiv Detail & Related papers (2023-05-24T01:53:49Z) - Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals [69.76245723797368]
Read and Reward speeds up RL algorithms on Atari games by reading manuals released by the Atari game developers.
Various RL algorithms obtain significant improvement in performance and training speed when assisted by our design.
arXiv Detail & Related papers (2023-02-09T05:47:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.