Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that
Don't have a Definitive Answer?
- URL: http://arxiv.org/abs/2309.04635v1
- Date: Fri, 8 Sep 2023 23:12:03 GMT
- Title: Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that
Don't have a Definitive Answer?
- Authors: Ayushi Agarwal, Nisarg Patel, Neeraj Varshney, Mihir Parmar, Pavan
Mallina, Aryan Bhavin Shah, Srihari Raju Sangaraju, Tirth Patel, Nihar
Thakkar, Chitta Baral
- Abstract summary: In real-world applications, users often ask questions that don't have a definitive answer.
We introduce QnotA, a dataset consisting of five different categories of questions that don't have definitive answers.
With this data, we formulate three evaluation tasks that test a system's ability to 'identify', 'distinguish', and 'justify' QnotA questions.
We show that even SOTA models including GPT-3 and Flan T5 do not fare well on these tasks and lack considerably behind the human performance baseline.
- Score: 43.03399918557937
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Though state-of-the-art (SOTA) NLP systems have achieved remarkable
performance on a variety of language understanding tasks, they primarily focus
on questions that have a correct and a definitive answer. However, in
real-world applications, users often ask questions that don't have a definitive
answer. Incorrectly answering such questions certainly hampers a system's
reliability and trustworthiness. Can SOTA models accurately identify such
questions and provide a reasonable response?
To investigate the above question, we introduce QnotA, a dataset consisting
of five different categories of questions that don't have definitive answers.
Furthermore, for each QnotA instance, we also provide a corresponding QA
instance i.e. an alternate question that ''can be'' answered. With this data,
we formulate three evaluation tasks that test a system's ability to 'identify',
'distinguish', and 'justify' QnotA questions. Through comprehensive
experiments, we show that even SOTA models including GPT-3 and Flan T5 do not
fare well on these tasks and lack considerably behind the human performance
baseline. We conduct a thorough analysis which further leads to several
interesting findings. Overall, we believe our work and findings will encourage
and facilitate further research in this important area and help develop more
robust models.
Related papers
- Which questions should I answer? Salience Prediction of Inquisitive Questions [118.097974193544]
We show that highly salient questions are empirically more likely to be answered in the same article.
We further validate our findings by showing that answering salient questions is an indicator of summarization quality in news.
arXiv Detail & Related papers (2024-04-16T21:33:05Z) - Don't Just Say "I don't know"! Self-aligning Large Language Models for Responding to Unknown Questions with Explanations [70.6395572287422]
Self-alignment method is capable of not only refusing to answer but also providing explanation to the unanswerability of unknown questions.
We conduct disparity-driven self-curation to select qualified data for fine-tuning the LLM itself for aligning the responses to unknown questions as desired.
arXiv Detail & Related papers (2024-02-23T02:24:36Z) - Model Analysis & Evaluation for Ambiguous Question Answering [0.0]
Question Answering models are required to generate long-form answers that often combine conflicting pieces of information.
Recent advances in the field have shown strong capabilities in generating fluent responses, but certain research questions remain unanswered.
We aim to thoroughly investigate these aspects, and provide valuable insights into the limitations of the current approaches.
arXiv Detail & Related papers (2023-05-21T15:20:20Z) - "John is 50 years old, can his son be 65?" Evaluating NLP Models'
Understanding of Feasibility [19.47954905054217]
This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible.
We show that even state-of-the-art models such as GPT-3 struggle to answer the feasibility questions correctly.
arXiv Detail & Related papers (2022-10-14T02:46:06Z) - RealTime QA: What's the Answer Right Now? [137.04039209995932]
We introduce REALTIME QA, a dynamic question answering (QA) platform that announces questions and evaluates systems on a regular basis.
We build strong baseline models upon large pretrained language models, including GPT-3 and T5.
GPT-3 tends to return outdated answers when retrieved documents do not provide sufficient information to find an answer.
arXiv Detail & Related papers (2022-07-27T07:26:01Z) - ASQA: Factoid Questions Meet Long-Form Answers [35.11889930792675]
This work focuses on factoid questions that are ambiguous, that is, have different correct answers depending on interpretation.
Answers to ambiguous questions should synthesize factual information from multiple sources into a long-form summary.
We use this notion of correctness to define an automated metric of performance for ASQA.
arXiv Detail & Related papers (2022-04-12T21:58:44Z) - A Dataset of Information-Seeking Questions and Answers Anchored in
Research Papers [66.11048565324468]
We present a dataset of 5,049 questions over 1,585 Natural Language Processing papers.
Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text.
We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers.
arXiv Detail & Related papers (2021-05-07T00:12:34Z) - ProtoQA: A Question Answering Dataset for Prototypical Common-Sense
Reasoning [35.6375880208001]
This paper introduces a new question answering dataset for training and evaluating common sense reasoning capabilities of artificial intelligence systems.
The training set is gathered from an existing set of questions played in a long-running international game show FAMILY- FEUD.
We also propose a generative evaluation task where a model has to output a ranked list of answers, ideally covering prototypical answers for a question.
arXiv Detail & Related papers (2020-05-02T09:40:05Z) - SQuINTing at VQA Models: Introspecting VQA Models with Sub-Questions [66.86887670416193]
We show that state-of-the-art VQA models have comparable performance in answering perception and reasoning questions, but suffer from consistency problems.
To address this shortcoming, we propose an approach called Sub-Question-aware Network Tuning (SQuINT)
We show that SQuINT improves model consistency by 5%, also marginally improving performance on the Reasoning questions in VQA, while also displaying better attention maps.
arXiv Detail & Related papers (2020-01-20T01:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.