"John is 50 years old, can his son be 65?" Evaluating NLP Models'
Understanding of Feasibility
- URL: http://arxiv.org/abs/2210.07471v1
- Date: Fri, 14 Oct 2022 02:46:06 GMT
- Title: "John is 50 years old, can his son be 65?" Evaluating NLP Models'
Understanding of Feasibility
- Authors: Himanshu Gupta, Neeraj Varshney, Swaroop Mishra, Kuntal Kumar Pal,
Saurabh Arjun Sawant, Kevin Scaria, Siddharth Goyal, Chitta Baral
- Abstract summary: This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible.
We show that even state-of-the-art models such as GPT-3 struggle to answer the feasibility questions correctly.
- Score: 19.47954905054217
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: In current NLP research, large-scale language models and their abilities are
widely being discussed. Some recent works have also found notable failures of
these models. Often these failure examples involve complex reasoning abilities.
This work focuses on a simple commonsense ability, reasoning about when an
action (or its effect) is feasible. We introduce FeasibilityQA, a
question-answering dataset involving binary classification (BCQ) and
multi-choice multi-correct questions (MCQ) that test understanding of
feasibility. We show that even state-of-the-art models such as GPT-3 struggle
to answer the feasibility questions correctly. Specifically, on (MCQ, BCQ)
questions, GPT-3 achieves accuracy of just (19%, 62%) and (25%, 64%) in
zero-shot and few-shot settings, respectively. We also evaluate models by
providing relevant knowledge statements required to answer the question and
find that the additional knowledge leads to a 7% gain in performance, but the
overall performance still remains low. These results make one wonder how much
commonsense knowledge about action feasibility is encoded in GPT-3 and how well
the model can reason about it.
Related papers
- R-Tuning: Instructing Large Language Models to Say `I Don't Know' [66.11375475253007]
Large language models (LLMs) have revolutionized numerous domains with their impressive performance but still face their challenges.
Previous instruction tuning methods force the model to complete a sentence no matter whether the model knows the knowledge or not.
We present a new approach called Refusal-Aware Instruction Tuning (R-Tuning)
Experimental results demonstrate R-Tuning effectively improves a model's ability to answer known questions and refrain from answering unknown questions.
arXiv Detail & Related papers (2023-11-16T08:45:44Z) - A Step Closer to Comprehensive Answers: Constrained Multi-Stage Question
Decomposition with Large Language Models [43.10340493000934]
We introduce the "Decompose-and-Query" framework (D&Q)
This framework guides the model to think and utilize external knowledge similar to ReAct.
On our ChitChatQA dataset, D&Q does not lose to ChatGPT in 67% of cases.
arXiv Detail & Related papers (2023-11-13T17:28:03Z) - Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models [59.05769810380928]
Rephrase, Augment and Reason (RepARe) is a gradient-free framework that extracts salient details about the image using the underlying vision-language model.
We show that RepARe can result in a 3.85% (absolute) increase in zero-shot accuracy on VQAv2, 6.41%, and 7.94% points increase on A-OKVQA, and VizWiz respectively.
arXiv Detail & Related papers (2023-10-09T16:57:57Z) - Can NLP Models 'Identify', 'Distinguish', and 'Justify' Questions that
Don't have a Definitive Answer? [43.03399918557937]
In real-world applications, users often ask questions that don't have a definitive answer.
We introduce QnotA, a dataset consisting of five different categories of questions that don't have definitive answers.
With this data, we formulate three evaluation tasks that test a system's ability to 'identify', 'distinguish', and 'justify' QnotA questions.
We show that even SOTA models including GPT-3 and Flan T5 do not fare well on these tasks and lack considerably behind the human performance baseline.
arXiv Detail & Related papers (2023-09-08T23:12:03Z) - Negated Complementary Commonsense using Large Language Models [3.42658286826597]
This work focuses on finding answers to negated complementary questions in commonsense scenarios.
We propose a model-agnostic methodology to improve the performance in negated complementary scenarios.
arXiv Detail & Related papers (2023-07-13T15:03:48Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Reliable Visual Question Answering: Abstain Rather Than Answer
Incorrectly [100.60560477391732]
We promote a problem formulation for reliable visual question answering (VQA)
We analyze both their coverage, the portion of questions answered, and risk, the error on that portion.
We find that although the best performing models achieve over 71% accuracy on the VQA v2 dataset, introducing the option to abstain limits them to answering less than 8% of the questions to achieve a low risk of error (i.e., 1%)
This motivates us to utilize a multimodal selection function to directly estimate the correctness of the predicted answers, which we show can triple the coverage from, for example, 5.0% to 16.7% at
arXiv Detail & Related papers (2022-04-28T16:51:27Z) - A New Score for Adaptive Tests in Bayesian and Credal Networks [64.80185026979883]
A test is adaptive when its sequence and number of questions is dynamically tuned on the basis of the estimated skills of the taker.
We present an alternative family of scores, based on the mode of the posterior probabilities, and hence easier to explain.
arXiv Detail & Related papers (2021-05-25T20:35:42Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.