Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4
- URL: http://arxiv.org/abs/2404.16692v1
- Date: Thu, 25 Apr 2024 15:53:00 GMT
- Title: Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4
- Authors: Lydia Uhler, Verena Jordan, Jürgen Buder, Markus Huff, Frank Papenmeier,
- Abstract summary: This study examined the problem-solving behavior of humans and OpenAl's GPT-4 large language model.
The experiments involved 588 participants from the U.S. and 680 iterations of the GPT-4 model.
- Score: 0.0
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We explored the addition bias, a cognitive tendency to prefer adding elements over removing them to alter an initial state or structure, by conducting four preregistered experiments examining the problem-solving behavior of both humans and OpenAl's GPT-4 large language model. The experiments involved 588 participants from the U.S. and 680 iterations of the GPT-4 model. The problem-solving task was either to create symmetry within a grid (Experiments 1 and 3) or to edit a summary (Experiments 2 and 4). As hypothesized, we found that overall, the addition bias was present. Solution efficiency (Experiments 1 and 2) and valence of the instruction (Experiments 3 and 4) played important roles. Human participants were less likely to use additive strategies when subtraction was relatively more efficient than when addition and subtraction were equally efficient. GPT-4 exhibited the opposite behavior, with a strong addition bias when subtraction was more efficient. In terms of instruction valence, GPT-4 was more likely to add words when asked to "improve" compared to "edit", whereas humans did not show this effect. When we looked at the addition bias under different conditions, we found more biased responses for GPT-4 compared to humans. Our findings highlight the importance of considering comparable and sometimes superior subtractive alternatives, as well as reevaluating one's own and particularly the language models' problem-solving behavior.
Related papers
- An Empirical Analysis on Large Language Models in Debate Evaluation [10.677407097411768]
We investigate the capabilities and inherent biases of advanced large language models (LLMs) such as GPT-3.5 and GPT-4 in the context of debate evaluation.
We uncover a consistent bias in both GPT-3.5 and GPT-4 towards the second candidate response presented.
We also uncover lexical biases in both GPT-3.5 and GPT-4, especially when label sets carry connotations such as numerical or sequential.
arXiv Detail & Related papers (2024-05-28T18:34:53Z) - Identifying and Improving Disability Bias in GPT-Based Resume Screening [9.881826151448198]
We ask ChatGPT to rank a resume against the same resume enhanced with an additional leadership award, scholarship, panel presentation, and membership that are disability related.
We find that GPT-4 exhibits prejudice towards these enhanced CVs.
We show that this prejudice can be quantifiably reduced by training a custom GPTs on principles of DEI and disability justice.
arXiv Detail & Related papers (2024-01-28T17:04:59Z) - Holistic Analysis of Hallucination in GPT-4V(ision): Bias and
Interference Challenges [54.42256219010956]
This benchmark is designed to evaluate and shed light on the two common types of hallucinations in visual language models: bias and interference.
bias refers to the model's tendency to hallucinate certain types of responses, possibly due to imbalance in its training data.
interference pertains to scenarios where the judgment of GPT-4V(ision) can be disrupted due to how the text prompt is phrased or how the input image is presented.
arXiv Detail & Related papers (2023-11-06T17:26:59Z) - Towards Understanding Sycophancy in Language Models [49.99654432561934]
We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback.
We show that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks.
Our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses.
arXiv Detail & Related papers (2023-10-20T14:46:48Z) - Can large language models provide useful feedback on research papers? A
large-scale empirical analysis [38.905758846360435]
High-quality peer reviews are increasingly difficult to obtain.
With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback.
We created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers.
arXiv Detail & Related papers (2023-10-03T04:14:17Z) - Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias [57.42417061979399]
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically.
In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs.
Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families.
arXiv Detail & Related papers (2023-08-01T01:39:25Z) - Inductive reasoning in humans and large language models [0.0]
We apply GPT-3.5 and GPT-4 to a classic problem in human inductive reasoning known as property induction.
Although GPT-3.5 struggles to capture many aspects of human behaviour, GPT-4 is much more successful.
arXiv Detail & Related papers (2023-06-11T00:23:25Z) - An Empirical Analysis of Parameter-Efficient Methods for Debiasing
Pre-Trained Language Models [55.14405248920852]
We conduct experiments with prefix tuning, prompt tuning, and adapter tuning on different language models and bias types to evaluate their debiasing performance.
We find that the parameter-efficient methods are effective in mitigating gender bias, where adapter tuning is consistently the most effective.
We also find that prompt tuning is more suitable for GPT-2 than BERT, and racial and religious bias is less effective when it comes to racial and religious bias.
arXiv Detail & Related papers (2023-06-06T23:56:18Z) - Is GPT-4 a Good Data Analyst? [67.35956981748699]
We consider GPT-4 as a data analyst to perform end-to-end data analysis with databases from a wide range of domains.
We design several task-specific evaluation metrics to systematically compare the performance between several professional human data analysts and GPT-4.
Experimental results show that GPT-4 can achieve comparable performance to humans.
arXiv Detail & Related papers (2023-05-24T11:26:59Z) - Humans in Humans Out: On GPT Converging Toward Common Sense in both
Success and Failure [0.0]
GPT-3, GPT-3.5, and GPT-4 were trained on large quantities of human-generated text.
We show that GPT-3 showed evidence of ETR-predicted outputs for 59% of these examples.
Remarkably, the production of human-like fallacious judgments increased from 18% in GPT-3 to 33% in GPT-3.5 and 34% in GPT-4.
arXiv Detail & Related papers (2023-03-30T10:32:18Z) - Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data.
We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more.
We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.