Alexpaca: Learning Factual Clarification Question Generation Without Examples
- URL: http://arxiv.org/abs/2310.11571v3
- Date: Fri, 11 Oct 2024 22:37:24 GMT
- Title: Alexpaca: Learning Factual Clarification Question Generation Without Examples
- Authors: Matthew Toles, Yukun Huang, Zhou Yu, Luis Gravano,
- Abstract summary: We present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks.
Humans outperform GPT-4 by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics.
- Score: 19.663171923249283
- License:
- Abstract: Real-life tasks such as giving legal or technical advice often lack complete context at the outset and can have disparate answers depending thereon. The ability to derive missing factual information by asking clarifying questions (ACQ) is an important element of real-life collaboration on such reasoning tasks. Existing factual clarification question challenges evaluate generations based on word overlap or human evaluations. Recent work explores generating a response to the clarifying question then evaluating its utility directly. So far, these tasks are limited to disambiguating the user's intent rather than concrete facts about the situation. The factual domain presents unique challenges since responses to clarification questions must be factually true for accurate evaluation. To enable evaluation of factual domain clarification question generation, We present a new task that focuses on the ability to elicit missing information in multi-hop reasoning tasks. The task, HotpotQA-FLM, can be evaluated automatically, making it convenient for benchmarking language models. We observe that humans outperform GPT-4 by a large margin, while Llama 3 8B Instruct does not even beat the dummy baseline in some metrics. Finally, we find by fine-tuning Llama 3 8B Instruct on its own generations, filtered via rejection sampling, we can improve information recovery by 27.6 percent.
Related papers
- Accurate and Nuanced Open-QA Evaluation Through Textual Entailment [4.762213968673381]
We propose to study the entailment relations of answers to identify more informative and more general system answers.
The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers.
arXiv Detail & Related papers (2024-05-26T21:33:27Z) - R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges.
Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not.
We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning)
Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z) - Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs [58.620269228776294]
We propose a task-agnostic framework for resolving ambiguity by asking users clarifying questions.
We evaluate systems across three NLP applications: question answering, machine translation and natural language inference.
We find that intent-sim is robust, demonstrating improvements across a wide range of NLP tasks and LMs.
arXiv Detail & Related papers (2023-11-16T00:18:50Z) - FreshLLMs: Refreshing Large Language Models with Search Engine
Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge.
We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types.
We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination.
Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z) - Answering Subjective Induction Questions on Products by Summarizing
Multi-sources Multi-viewpoints Knowledge [0.04791377777154766]
This paper proposes a new task in the field of Answering Subjective Induction Question on Products.
The answer to this kind of question is non-unique, but can be interpreted from many perspectives.
A satisfied answer should be able to summarize these subjective opinions from multiple sources and provide objective knowledge.
arXiv Detail & Related papers (2023-09-12T03:27:08Z) - A Critical Evaluation of Evaluations for Long-form Question Answering [48.51361567469683]
Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation.
We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices.
arXiv Detail & Related papers (2023-05-29T16:54:24Z) - Mastering the ABCDs of Complex Questions: Answer-Based Claim
Decomposition for Fine-grained Self-Evaluation [9.776667356119352]
We propose answer-based claim decomposition (ABCD), a prompting strategy that decomposes questions into true/false claims.
Using the decomposed ABCD claims, we perform fine-grained self-evaluation.
We find that GPT-3.5 has some ability to determine to what extent its answer satisfies the criteria of the input question.
arXiv Detail & Related papers (2023-05-24T05:53:11Z) - WikiWhy: Answering and Explaining Cause-and-Effect Questions [62.60993594814305]
We introduce WikiWhy, a QA dataset built around explaining why an answer is true in natural language.
WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics.
GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition.
arXiv Detail & Related papers (2022-10-21T17:59:03Z) - Measuring and Narrowing the Compositionality Gap in Language Models [116.5228850227024]
We measure how often models can correctly answer all sub-problems but not generate the overall solution.
We present a new method, self-ask, that further improves on chain of thought.
arXiv Detail & Related papers (2022-10-07T06:50:23Z) - ASQA: Factoid Questions Meet Long-Form Answers [35.11889930792675]
This work focuses on factoid questions that are ambiguous, that is, have different correct answers depending on interpretation.
Answers to ambiguous questions should synthesize factual information from multiple sources into a long-form summary.
We use this notion of correctness to define an automated metric of performance for ASQA.
arXiv Detail & Related papers (2022-04-12T21:58:44Z) - Review-guided Helpful Answer Identification in E-commerce [38.276241153439955]
Product-specific community question answering platforms can greatly help address the concerns of potential customers.
The user-provided answers on such platforms often vary a lot in their qualities.
Helpfulness votes from the community can indicate the overall quality of the answer, but they are often missing.
arXiv Detail & Related papers (2020-03-13T11:34:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.