Humans in Humans Out: On GPT Converging Toward Common Sense in both
Success and Failure
- URL: http://arxiv.org/abs/2303.17276v1
- Date: Thu, 30 Mar 2023 10:32:18 GMT
- Title: Humans in Humans Out: On GPT Converging Toward Common Sense in both
Success and Failure
- Authors: Philipp Koralus, Vincent Wang-Ma\'scianica
- Abstract summary: GPT-3, GPT-3.5, and GPT-4 were trained on large quantities of human-generated text.
We show that GPT-3 showed evidence of ETR-predicted outputs for 59% of these examples.
Remarkably, the production of human-like fallacious judgments increased from 18% in GPT-3 to 33% in GPT-3.5 and 34% in GPT-4.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Increase in computational scale and fine-tuning has seen a dramatic
improvement in the quality of outputs of large language models (LLMs) like GPT.
Given that both GPT-3 and GPT-4 were trained on large quantities of
human-generated text, we might ask to what extent their outputs reflect
patterns of human thinking, both for correct and incorrect cases. The Erotetic
Theory of Reason (ETR) provides a symbolic generative model of both human
success and failure in thinking, across propositional, quantified, and
probabilistic reasoning, as well as decision-making. We presented GPT-3,
GPT-3.5, and GPT-4 with 61 central inference and judgment problems from a
recent book-length presentation of ETR, consisting of experimentally verified
data-points on human judgment and extrapolated data-points predicted by ETR,
with correct inference patterns as well as fallacies and framing effects (the
ETR61 benchmark). ETR61 includes classics like Wason's card task, illusory
inferences, the decoy effect, and opportunity-cost neglect, among others. GPT-3
showed evidence of ETR-predicted outputs for 59% of these examples, rising to
77% in GPT-3.5 and 75% in GPT-4. Remarkably, the production of human-like
fallacious judgments increased from 18% in GPT-3 to 33% in GPT-3.5 and 34% in
GPT-4. This suggests that larger and more advanced LLMs may develop a tendency
toward more human-like mistakes, as relevant thought patterns are inherent in
human-produced training data. According to ETR, the same fundamental patterns
are involved both in successful and unsuccessful ordinary reasoning, so that
the "bad" cases could paradoxically be learned from the "good" cases. We
further present preliminary evidence that ETR-inspired prompt engineering could
reduce instances of these mistakes.
Related papers
- Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored.
This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma.
Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z) - AI-enhanced Auto-correction of Programming Exercises: How Effective is
GPT-3.5? [0.0]
This paper investigates the potential of AI in providing personalized code correction and generating feedback.
GPT-3.5 exhibited weaknesses in its evaluation, including localization of errors that were not the actual errors, or even hallucinated errors.
arXiv Detail & Related papers (2023-10-24T10:35:36Z) - Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias [57.42417061979399]
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically.
In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs.
Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families.
arXiv Detail & Related papers (2023-08-01T01:39:25Z) - SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its
Departure from Current Machine Learning [5.177947445379688]
This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis.
Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification.
The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations.
arXiv Detail & Related papers (2023-07-16T05:33:35Z) - A negation detection assessment of GPTs: analysis with the xNot360
dataset [9.165119034384027]
Negation is a fundamental aspect of natural language, playing a critical role in communication and comprehension.
We focus on the identification of negation in natural language using a zero-shot prediction approach applied to our custom xNot360 dataset.
Our findings expose a considerable performance disparity among the GPT models, with GPT-4 surpassing its counterparts and GPT-3.5 displaying a marked performance reduction.
arXiv Detail & Related papers (2023-06-29T02:27:48Z) - DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT
Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5.
We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information.
Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z) - GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot
Setting and Performance Boosting Through Prompts [0.0]
Large Language Models (LLMs) have exhibited remarkable performance on various Natural Language Processing (NLP) tasks.
In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks.
arXiv Detail & Related papers (2023-05-21T14:45:17Z) - Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error
Correction [28.58384091374763]
GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks.
We perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks.
We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting.
arXiv Detail & Related papers (2023-03-25T03:08:49Z) - GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs.
It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z) - Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality.
We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z) - News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets.
We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.