Negated Complementary Commonsense using Large Language Models
- URL: http://arxiv.org/abs/2307.06794v1
- Date: Thu, 13 Jul 2023 15:03:48 GMT
- Title: Negated Complementary Commonsense using Large Language Models
- Authors: Navid Rezaei, Marek Z. Reformat
- Abstract summary: This work focuses on finding answers to negated complementary questions in commonsense scenarios.
We propose a model-agnostic methodology to improve the performance in negated complementary scenarios.
- Score: 3.42658286826597
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Larger language models, such as GPT-3, have shown to be excellent in many
tasks. However, we demonstrate that out-of-ordinary questions can throw the
model off guard. This work focuses on finding answers to negated complementary
questions in commonsense scenarios. We illustrate how such questions adversely
affect the model responses. We propose a model-agnostic methodology to improve
the performance in negated complementary scenarios. Our method outperforms
few-shot generation from GPT-3 (by more than 11 points) and, more importantly,
highlights the significance of studying the response of large language models
in negated complementary questions. The code, data, and experiments are
available under: https://github.com/navidre/negated_complementary_commonsense.
Related papers
- Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - Reliability Check: An Analysis of GPT-3's Response to Sensitive Topics
and Prompt Wording [0.0]
We analyze what confuses GPT-3: how the model responds to certain sensitive topics and what effects the prompt wording has on the model response.
We find that GPT-3 correctly disagrees with obvious Conspiracies and Stereotypes but makes mistakes with common Misconceptions and Controversies.
The model responses are inconsistent across prompts and settings, highlighting GPT-3's unreliability.
arXiv Detail & Related papers (2023-06-09T19:07:31Z) - "John is 50 years old, can his son be 65?" Evaluating NLP Models'
Understanding of Feasibility [19.47954905054217]
This work focuses on a simple commonsense ability, reasoning about when an action (or its effect) is feasible.
We show that even state-of-the-art models such as GPT-3 struggle to answer the feasibility questions correctly.
arXiv Detail & Related papers (2022-10-14T02:46:06Z) - Measuring and Narrowing the Compositionality Gap in Language Models [116.5228850227024]
We measure how often models can correctly answer all sub-problems but not generate the overall solution.
We present a new method, self-ask, that further improves on chain of thought.
arXiv Detail & Related papers (2022-10-07T06:50:23Z) - Learn to Explain: Multimodal Reasoning via Thought Chains for Science
Question Answering [124.16250115608604]
We present Science Question Answering (SQA), a new benchmark that consists of 21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations.
We show that SQA improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA.
Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data.
arXiv Detail & Related papers (2022-09-20T07:04:24Z) - Elaboration-Generating Commonsense Question Answering at Scale [77.96137534751445]
In question answering requiring common sense, language models (e.g., GPT-3) have been used to generate text expressing background knowledge.
We finetune smaller language models to generate useful intermediate context, referred to here as elaborations.
Our framework alternates between updating two language models -- an elaboration generator and an answer predictor -- allowing each to influence the other.
arXiv Detail & Related papers (2022-09-02T18:32:09Z) - Exploring Strategies for Generalizable Commonsense Reasoning with
Pre-trained Models [62.28551903638434]
We measure the impact of three different adaptation methods on the generalization and accuracy of models.
Experiments with two models show that fine-tuning performs best, by learning both the content and the structure of the task, but suffers from overfitting and limited generalization to novel answers.
We observe that alternative adaptation methods like prefix-tuning have comparable accuracy, but generalize better to unseen answers and are more robust to adversarial splits.
arXiv Detail & Related papers (2021-09-07T03:13:06Z) - How Can We Know When Language Models Know? On the Calibration of
Language Models for Question Answering [80.82194311274694]
We examine the question "how can we know when language models know, with confidence, the answer to a particular query?"
We examine three strong generative models -- T5, BART, and GPT-2 -- and study whether their probabilities on QA tasks are well calibrated.
We then examine methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness.
arXiv Detail & Related papers (2020-12-02T03:53:13Z) - Better Distractions: Transformer-based Distractor Generation and
Multiple Choice Question Filtering [4.168157981135697]
We train a GPT-2 language model to generate three distractors for a given question and text context.
Next, we train a BERT language model to answer multiple choice questions (MCQs) and use this model as a filter.
arXiv Detail & Related papers (2020-10-19T15:23:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.