Reliability Check: An Analysis of GPT-3's Response to Sensitive Topics
and Prompt Wording
- URL: http://arxiv.org/abs/2306.06199v1
- Date: Fri, 9 Jun 2023 19:07:31 GMT
- Title: Reliability Check: An Analysis of GPT-3's Response to Sensitive Topics
and Prompt Wording
- Authors: Aisha Khatun and Daniel G. Brown
- Abstract summary: We analyze what confuses GPT-3: how the model responds to certain sensitive topics and what effects the prompt wording has on the model response.
We find that GPT-3 correctly disagrees with obvious Conspiracies and Stereotypes but makes mistakes with common Misconceptions and Controversies.
The model responses are inconsistent across prompts and settings, highlighting GPT-3's unreliability.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) have become mainstream technology with their
versatile use cases and impressive performance. Despite the countless
out-of-the-box applications, LLMs are still not reliable. A lot of work is
being done to improve the factual accuracy, consistency, and ethical standards
of these models through fine-tuning, prompting, and Reinforcement Learning with
Human Feedback (RLHF), but no systematic analysis of the responses of these
models to different categories of statements, or on their potential
vulnerabilities to simple prompting changes is available. In this work, we
analyze what confuses GPT-3: how the model responds to certain sensitive topics
and what effects the prompt wording has on the model response. We find that
GPT-3 correctly disagrees with obvious Conspiracies and Stereotypes but makes
mistakes with common Misconceptions and Controversies. The model responses are
inconsistent across prompts and settings, highlighting GPT-3's unreliability.
Dataset and code of our analysis is available in
https://github.com/tanny411/GPT3-Reliability-Check.
Related papers
- WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large
Language Models [35.088946378980914]
We run our benchmark on three state-of-the-art chat-LLMs (GPT3.5, GPT4 and Llama2-chat)
We show that these models make errors even with as few as three objects.
Errors persist even with chain-of-thought prompting and in-context learning.
arXiv Detail & Related papers (2023-11-27T15:38:17Z) - Negated Complementary Commonsense using Large Language Models [3.42658286826597]
This work focuses on finding answers to negated complementary questions in commonsense scenarios.
We propose a model-agnostic methodology to improve the performance in negated complementary scenarios.
arXiv Detail & Related papers (2023-07-13T15:03:48Z) - SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for
Generative Large Language Models [55.60306377044225]
"SelfCheckGPT" is a simple sampling-based approach to fact-check the responses of black-box models.
We investigate this approach by using GPT-3 to generate passages about individuals from the WikiBio dataset.
arXiv Detail & Related papers (2023-03-15T19:31:21Z) - Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour.
Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z) - The Unreliability of Explanations in Few-Shot In-Context Learning [50.77996380021221]
We focus on two NLP tasks that involve reasoning over text, namely question answering and natural language inference.
We show that explanations judged as good by humans--those that are logically consistent with the input--usually indicate more accurate predictions.
We present a framework for calibrating model predictions based on the reliability of the explanations.
arXiv Detail & Related papers (2022-05-06T17:57:58Z) - AES Systems Are Both Overstable And Oversensitive: Explaining Why And
Proposing Defenses [66.49753193098356]
We investigate the reason behind the surprising adversarial brittleness of scoring models.
Our results indicate that autoscoring models, despite getting trained as "end-to-end" models, behave like bag-of-words models.
We propose detection-based protection models that can detect oversensitivity and overstability causing samples with high accuracies.
arXiv Detail & Related papers (2021-09-24T03:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.