Prompting GPT-3 To Be Reliable
- URL: http://arxiv.org/abs/2210.09150v1
- Date: Mon, 17 Oct 2022 14:52:39 GMT
- Title: Prompting GPT-3 To Be Reliable
- Authors: Chenglei Si, Zhe Gan, Zhengyuan Yang, Shuohang Wang, Jianfeng Wang,
Jordan Boyd-Graber, Lijuan Wang
- Abstract summary: This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
- Score: 117.23966502293796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) show impressive abilities via few-shot
prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use
in real-world language applications. However, existing research focuses on
models' accuracy on standard benchmarks and largely ignores their reliability,
which is crucial for avoiding catastrophic real-world harms. While reliability
is a broad and vaguely defined term, this work decomposes reliability into four
facets: generalizability, fairness, calibration, and factuality. We establish
simple and effective prompts to demonstrate GPT-3's reliability in these four
aspects: 1) generalize out-of-domain, 2) balance demographic distribution to
reduce social biases, 3) calibrate language model probabilities, and 4) update
the LLM's knowledge. We find that by employing appropriate prompts, GPT-3
outperforms smaller-scale supervised models by large margins on all these
facets. We release all processed datasets, evaluation scripts, and model
predictions to facilitate future analysis. Our findings not only shed new
insights on the reliability of prompting LLMs, but more importantly, our
prompting strategies can help practitioners more reliably use large language
models like GPT-3.
Related papers
- Can LLMs replace Neil deGrasse Tyson? Evaluating the Reliability of LLMs as Science Communicators [22.567933207841968]
Large Language Models (LLMs) and AI assistants are experiencing exponential growth in usage among both expert and amateur users.
In this work, we focus on evaluating the reliability of current LLMs as science communicators.
We introduce a novel dataset, SCiPS-QA, comprising 742 Yes/No queries embedded in complex scientific concepts.
arXiv Detail & Related papers (2024-09-21T06:48:32Z) - Multimodal Large Language Models to Support Real-World Fact-Checking [80.41047725487645]
Multimodal large language models (MLLMs) carry the potential to support humans in processing vast amounts of information.
While MLLMs are already being used as a fact-checking tool, their abilities and limitations in this regard are understudied.
We propose a framework for systematically assessing the capacity of current multimodal models to facilitate real-world fact-checking.
arXiv Detail & Related papers (2024-03-06T11:32:41Z) - How Easy is It to Fool Your Multimodal LLMs? An Empirical Analysis on Deceptive Prompts [54.07541591018305]
We present MAD-Bench, a benchmark that contains 1000 test samples divided into 5 categories, such as non-existent objects, count of objects, and spatial relationship.
We provide a comprehensive analysis of popular MLLMs, ranging from GPT-4v, Reka, Gemini-Pro, to open-sourced models, such as LLaVA-NeXT and MiniCPM-Llama3.
While GPT-4o achieves 82.82% accuracy on MAD-Bench, the accuracy of any other model in our experiments ranges from 9% to 50%.
arXiv Detail & Related papers (2024-02-20T18:31:27Z) - Global-Liar: Factuality of LLMs over Time and Geographic Regions [3.715487408753612]
This study evaluates the factual accuracy, stability, and biases in widely adopted GPT models, including GPT-3.5 and GPT-4.
We introduce 'Global-Liar,' a dataset uniquely balanced in terms of geographic and temporal representation.
arXiv Detail & Related papers (2024-01-31T13:57:24Z) - Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs [60.61002524947733]
Previous confidence elicitation methods rely on white-box access to internal model information or model fine-tuning.
This leads to a growing need to explore the untapped area of black-box approaches for uncertainty estimation.
We define a systematic framework with three components: prompting strategies for eliciting verbalized confidence, sampling methods for generating multiple responses, and aggregation techniques for computing consistency.
arXiv Detail & Related papers (2023-06-22T17:31:44Z) - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z) - Reliability Check: An Analysis of GPT-3's Response to Sensitive Topics
and Prompt Wording [0.0]
We analyze what confuses GPT-3: how the model responds to certain sensitive topics and what effects the prompt wording has on the model response.
We find that GPT-3 correctly disagrees with obvious Conspiracies and Stereotypes but makes mistakes with common Misconceptions and Controversies.
The model responses are inconsistent across prompts and settings, highlighting GPT-3's unreliability.
arXiv Detail & Related papers (2023-06-09T19:07:31Z) - Towards Reliable Misinformation Mitigation: Generalization, Uncertainty,
and GPT-4 [5.313670352036673]
We show that GPT-4 can outperform prior methods in multiple settings and languages.
We propose techniques to handle uncertainty that can detect impossible examples and strongly improve outcomes.
This research lays the groundwork for future tools that can drive real-world progress to combat misinformation.
arXiv Detail & Related papers (2023-05-24T09:10:20Z) - Consistency Analysis of ChatGPT [65.268245109828]
This paper investigates the trustworthiness of ChatGPT and GPT-4 regarding logically consistent behaviour.
Our findings suggest that while both models appear to show an enhanced language understanding and reasoning ability, they still frequently fall short of generating logically consistent predictions.
arXiv Detail & Related papers (2023-03-11T01:19:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.