Related papers: Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure

Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure

URL: http://arxiv.org/abs/2303.17276v1
Date: Thu, 30 Mar 2023 10:32:18 GMT
Title: Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure
Authors: Philipp Koralus, Vincent Wang-Ma\'scianica
Abstract summary: GPT-3, GPT-3.5, and GPT-4 were trained on large quantities of human-generated text. We show that GPT-3 showed evidence of ETR-predicted outputs for 59% of these examples. Remarkably, the production of human-like fallacious judgments increased from 18% in GPT-3 to 33% in GPT-3.5 and 34% in GPT-4.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Increase in computational scale and fine-tuning has seen a dramatic improvement in the quality of outputs of large language models (LLMs) like GPT. Given that both GPT-3 and GPT-4 were trained on large quantities of human-generated text, we might ask to what extent their outputs reflect patterns of human thinking, both for correct and incorrect cases. The Erotetic Theory of Reason (ETR) provides a symbolic generative model of both human success and failure in thinking, across propositional, quantified, and probabilistic reasoning, as well as decision-making. We presented GPT-3, GPT-3.5, and GPT-4 with 61 central inference and judgment problems from a recent book-length presentation of ETR, consisting of experimentally verified data-points on human judgment and extrapolated data-points predicted by ETR, with correct inference patterns as well as fallacies and framing effects (the ETR61 benchmark). ETR61 includes classics like Wason's card task, illusory inferences, the decoy effect, and opportunity-cost neglect, among others. GPT-3 showed evidence of ETR-predicted outputs for 59% of these examples, rising to 77% in GPT-3.5 and 75% in GPT-4. Remarkably, the production of human-like fallacious judgments increased from 18% in GPT-3 to 33% in GPT-3.5 and 34% in GPT-4. This suggests that larger and more advanced LLMs may develop a tendency toward more human-like mistakes, as relevant thought patterns are inherent in human-produced training data. According to ETR, the same fundamental patterns are involved both in successful and unsuccessful ordinary reasoning, so that the "bad" cases could paradoxically be learned from the "good" cases. We further present preliminary evidence that ETR-inspired prompt engineering could reduce instances of these mistakes.

Related papers

Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data. These findings highlight the reliance on recall over rigorous logical inference. This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z)
The Impact of Example Selection in Few-Shot Prompting on Automated Essay Scoring Using GPT Models [0.0]
This study investigates the impact of example selection on the performance of au-tomated essay scoring (AES) using few-shot prompting with GPT models. We evaluate the effects of the choice and order of examples in few-shot prompting on several versions of GPT-3.5 and GPT-4 models.
arXiv Detail & Related papers (2024-11-28T05:24:51Z)
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z)
AI-enhanced Auto-correction of Programming Exercises: How Effective is GPT-3.5? [0.0]
This paper investigates the potential of AI in providing personalized code correction and generating feedback. GPT-3.5 exhibited weaknesses in its evaluation, including localization of errors that were not the actual errors, or even hallucinated errors.
arXiv Detail & Related papers (2023-10-24T10:35:36Z)
Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias [57.42417061979399]
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically. In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs. Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families.
arXiv Detail & Related papers (2023-08-01T01:39:25Z)
SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning [5.177947445379688]
This study presents a thorough examination of various Generative Pretrained Transformer (GPT) methodologies in sentiment analysis. Three primary strategies are employed: 1) prompt engineering using the advanced GPT-3.5 Turbo, 2) fine-tuning GPT models, and 3) an inventive approach to embedding classification. The research yields detailed comparative insights among these strategies and individual GPT models, revealing their unique strengths and potential limitations.
arXiv Detail & Related papers (2023-07-16T05:33:35Z)
A negation detection assessment of GPTs: analysis with the xNot360 dataset [9.165119034384027]
Negation is a fundamental aspect of natural language, playing a critical role in communication and comprehension. We focus on the identification of negation in natural language using a zero-shot prediction approach applied to our custom xNot360 dataset. Our findings expose a considerable performance disparity among the GPT models, with GPT-4 surpassing its counterparts and GPT-3.5 displaying a marked performance reduction.
arXiv Detail & Related papers (2023-06-29T02:27:48Z)
DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models [92.6951708781736]
This work proposes a comprehensive trustworthiness evaluation for large language models with a focus on GPT-4 and GPT-3.5. We find that GPT models can be easily misled to generate toxic and biased outputs and leak private information. Our work illustrates a comprehensive trustworthiness evaluation of GPT models and sheds light on the trustworthiness gaps.
arXiv Detail & Related papers (2023-06-20T17:24:23Z)
GPT-3.5, GPT-4, or BARD? Evaluating LLMs Reasoning Ability in Zero-Shot Setting and Performance Boosting Through Prompts [0.0]
Large Language Models (LLMs) have exhibited remarkable performance on various Natural Language Processing (NLP) tasks. In this paper, we examine the performance of GPT-3.5, GPT-4, and BARD models, by performing a thorough technical evaluation on different reasoning tasks.
arXiv Detail & Related papers (2023-05-21T14:45:17Z)
Analyzing the Performance of GPT-3.5 and GPT-4 in Grammatical Error Correction [28.58384091374763]
GPT-3 and GPT-4 models are powerful, achieving high performance on a variety of Natural Language Processing tasks. We perform experiments testing the capabilities of a GPT-3.5 model (text-davinci-003) and a GPT-4 model (gpt-4-0314) on major GEC benchmarks. We report the performance of our best prompt on the BEA-2019 and JFLEG datasets, finding that the GPT models can perform well in a sentence-level revision setting.
arXiv Detail & Related papers (2023-03-25T03:08:49Z)
GPT-4 Technical Report [116.90398195245983]
GPT-4 is a large-scale, multimodal model which can accept image and text inputs and produce text outputs. It exhibits human-level performance on various professional and academic benchmarks, including passing a simulated bar exam with a score around the top 10% of test takers.
arXiv Detail & Related papers (2023-03-15T17:15:04Z)
Prompting GPT-3 To Be Reliable [117.23966502293796]
This work decomposes reliability into four facets: generalizability, fairness, calibration, and factuality. We find that GPT-3 outperforms smaller-scale supervised models by large margins on all these facets.
arXiv Detail & Related papers (2022-10-17T14:52:39Z)
News Summarization and Evaluation in the Era of GPT-3 [73.48220043216087]
We study how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality.
arXiv Detail & Related papers (2022-09-26T01:04:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.