Related papers: Pragmatic Evaluation of Clarifying Questions with Fact-Level Masking

Pragmatic Evaluation of Clarifying Questions with Fact-Level Masking

URL: http://arxiv.org/abs/2310.11571v2
Date: Sun, 7 Jan 2024 21:01:55 GMT
Title: Pragmatic Evaluation of Clarifying Questions with Fact-Level Masking
Authors: Matthew Toles, Yukun Huang, Zhou Yu, Luis Gravano
Abstract summary: We present a definition and framework for natural language pragmatic asking of clarifying questions (PACQ) We also present fact-level masking (FLM), a procedure for converting natural language datasets into self-supervised PACQ datasets. Our experiments show that current zero-shot models struggle to ask questions that retrieve useful information, as compared to human annotators.
Score: 21.480602733510256
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: The ability to derive useful information by asking clarifying questions (ACQ) is an important element of real life collaboration on reasoning tasks, such as question answering (QA). Existing natural language ACQ challenges, however, evaluate generations based on word overlap rather than the value of the information itself. Word overlap is often an inappropriate metric for question generation since many different questions could be useful in a given situation, and a single question can be phrased many different ways. Instead, we propose evaluating questions pragmatically based on the value of the information they retrieve. Here we present a definition and framework for natural language pragmatic asking of clarifying questions (PACQ), the problem of generating questions that result in answers useful for a reasoning task. We also present fact-level masking (FLM), a procedure for converting natural language datasets into self-supervised PACQ datasets by omitting particular critical facts. Finally, we generate a PACQ dataset from the HotpotQA dataset using FLM and evaluate several zero-shot language models on it. Our experiments show that current zero-shot models struggle to ask questions that retrieve useful information, as compared to human annotators. These results demonstrate an opportunity to use FLM datasets and the PACQ framework to objectively evaluate and improve question generation and other language models.

Related papers

Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z)
MinosEval: Distinguishing Factoid and Non-Factoid for Tailored Open-Ended QA Evaluation with LLMs [15.278241998033822]
Open-ended question answering (QA) is a key task for evaluating the capabilities of large language models (LLMs)<n>We propose textbfMinosEval, a novel evaluation method that first distinguishes open-ended questions and then ranks candidate answers.
arXiv Detail & Related papers (2025-06-18T07:49:13Z)
Accurate and Nuanced Open-QA Evaluation Through Textual Entailment [4.762213968673381]
We propose to study the entailment relations of answers to identify more informative and more general system answers. The entailment-based evaluation we propose allows the assignment of bonus or partial marks by quantifying the inference gap between answers.
arXiv Detail & Related papers (2024-05-26T21:33:27Z)
R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges. Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not. We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning) Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z)
Clarify When Necessary: Resolving Ambiguity Through Interaction with LMs [58.620269228776294]
We propose a task-agnostic framework for resolving ambiguity by asking users clarifying questions. We evaluate systems across three NLP applications: question answering, machine translation and natural language inference. We find that intent-sim is robust, demonstrating improvements across a wide range of NLP tasks and LMs.
arXiv Detail & Related papers (2023-11-16T00:18:50Z)
FreshLLMs: Refreshing Large Language Models with Search Engine Augmentation [92.43001160060376]
We study the factuality of large language models (LLMs) in the context of answering questions that test current world knowledge. We introduce FreshQA, a novel dynamic QA benchmark encompassing a diverse range of question and answer types. We benchmark a diverse array of both closed and open-source LLMs under a two-mode evaluation procedure that allows us to measure both correctness and hallucination. Motivated by these results, we present FreshPrompt, a simple few-shot prompting method that substantially boosts the performance of an LLM on FreshQA.
arXiv Detail & Related papers (2023-10-05T00:04:12Z)
Answering Subjective Induction Questions on Products by Summarizing Multi-sources Multi-viewpoints Knowledge [0.04791377777154766]
This paper proposes a new task in the field of Answering Subjective Induction Question on Products. The answer to this kind of question is non-unique, but can be interpreted from many perspectives. A satisfied answer should be able to summarize these subjective opinions from multiple sources and provide objective knowledge.
arXiv Detail & Related papers (2023-09-12T03:27:08Z)
A Critical Evaluation of Evaluations for Long-form Question Answering [48.51361567469683]
Long-form question answering (LFQA) enables answering a wide range of questions, but its flexibility poses enormous challenges for evaluation. We perform the first targeted study of the evaluation of long-form answers, covering both human and automatic evaluation practices.
arXiv Detail & Related papers (2023-05-29T16:54:24Z)
Mastering the ABCDs of Complex Questions: Answer-Based Claim Decomposition for Fine-grained Self-Evaluation [9.776667356119352]
We propose answer-based claim decomposition (ABCD), a prompting strategy that decomposes questions into true/false claims. Using the decomposed ABCD claims, we perform fine-grained self-evaluation. We find that GPT-3.5 has some ability to determine to what extent its answer satisfies the criteria of the input question.
arXiv Detail & Related papers (2023-05-24T05:53:11Z)
WikiWhy: Answering and Explaining Cause-and-Effect Questions [62.60993594814305]
We introduce WikiWhy, a QA dataset built around explaining why an answer is true in natural language. WikiWhy contains over 9,000 "why" question-answer-rationale triples, grounded on Wikipedia facts across a diverse set of topics. GPT-3 baselines achieve only 38.7% human-evaluated correctness in the end-to-end answer & explain condition.
arXiv Detail & Related papers (2022-10-21T17:59:03Z)
Measuring and Narrowing the Compositionality Gap in Language Models [116.5228850227024]
We measure how often models can correctly answer all sub-problems but not generate the overall solution. We present a new method, self-ask, that further improves on chain of thought.
arXiv Detail & Related papers (2022-10-07T06:50:23Z)
ASQA: Factoid Questions Meet Long-Form Answers [35.11889930792675]
This work focuses on factoid questions that are ambiguous, that is, have different correct answers depending on interpretation. Answers to ambiguous questions should synthesize factual information from multiple sources into a long-form summary. We use this notion of correctness to define an automated metric of performance for ASQA.
arXiv Detail & Related papers (2022-04-12T21:58:44Z)
Review-guided Helpful Answer Identification in E-commerce [38.276241153439955]
Product-specific community question answering platforms can greatly help address the concerns of potential customers. The user-provided answers on such platforms often vary a lot in their qualities. Helpfulness votes from the community can indicate the overall quality of the answer, but they are often missing.
arXiv Detail & Related papers (2020-03-13T11:34:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.