Related papers: Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4

Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4

URL: http://arxiv.org/abs/2404.16692v1
Date: Thu, 25 Apr 2024 15:53:00 GMT
Title: Influence of Solution Efficiency and Valence of Instruction on Additive and Subtractive Solution Strategies in Humans and GPT-4
Authors: Lydia Uhler, Verena Jordan, Jürgen Buder, Markus Huff, Frank Papenmeier,
Abstract summary: This study examined the problem-solving behavior of humans and OpenAl's GPT-4 large language model. The experiments involved 588 participants from the U.S. and 680 iterations of the GPT-4 model.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We explored the addition bias, a cognitive tendency to prefer adding elements over removing them to alter an initial state or structure, by conducting four preregistered experiments examining the problem-solving behavior of both humans and OpenAl's GPT-4 large language model. The experiments involved 588 participants from the U.S. and 680 iterations of the GPT-4 model. The problem-solving task was either to create symmetry within a grid (Experiments 1 and 3) or to edit a summary (Experiments 2 and 4). As hypothesized, we found that overall, the addition bias was present. Solution efficiency (Experiments 1 and 2) and valence of the instruction (Experiments 3 and 4) played important roles. Human participants were less likely to use additive strategies when subtraction was relatively more efficient than when addition and subtraction were equally efficient. GPT-4 exhibited the opposite behavior, with a strong addition bias when subtraction was more efficient. In terms of instruction valence, GPT-4 was more likely to add words when asked to "improve" compared to "edit", whereas humans did not show this effect. When we looked at the addition bias under different conditions, we found more biased responses for GPT-4 compared to humans. Our findings highlight the importance of considering comparable and sometimes superior subtractive alternatives, as well as reevaluating one's own and particularly the language models' problem-solving behavior.

Related papers

Exploring Expert Failures Improves LLM Agent Tuning [74.0772570556016]
We propose Exploring Expert Failures (EEF), which identifies beneficial actions from failed expert trajectories. EEF successfully solves some previously unsolvable subtasks and improves agent tuning performance.
arXiv Detail & Related papers (2025-04-17T17:53:54Z)
Assessing the Reliability and Validity of GPT-4 in Annotating Emotion Appraisal Ratings [0.6008132390640295]
This paper studies GPT-4 as a reader-annotator of 21 specific appraisal ratings in different prompt settings. We found that GPT-4 is an effective reader-annotator that performs close to or even slightly better than human annotators.
arXiv Detail & Related papers (2025-03-21T06:35:49Z)
Exploring Multimodal Perception in Large Language Models Through Perceptual Strength Ratings [2.539879170527831]
The research compared GPT-3.5, GPT-4, GPT-4o, and GPT-4o-mini, highlighting the influence of multimodal inputs on grounding and linguistic reasoning. GPT-4 and GPT-4o demonstrated strong alignment with human evaluations and significant advancements over smaller models. GPT-4o did not exhibit superior grounding compared to GPT-4, raising questions about their role in improving human-like grounding.
arXiv Detail & Related papers (2025-03-10T06:52:35Z)
An Empirical Study on Information Extraction using Large Language Models [36.090082785047855]
Human-like large language models (LLMs) have proven to be very helpful for many natural language processing (NLP) related tasks. We propose and analyze the effects of a series of simple prompt-based methods on GPT-4's information extraction ability.
arXiv Detail & Related papers (2024-08-31T07:10:16Z)
From Text to Emotion: Unveiling the Emotion Annotation Capabilities of LLMs [12.199629860735195]
We compare GPT4 with supervised models and or humans in three aspects: agreement with human annotations, alignment with human perception, and impact on model training. We find that common metrics that use aggregated human annotations as ground truth can underestimate the performance, of GPT-4.
arXiv Detail & Related papers (2024-08-30T05:50:15Z)
Can AI Replace Human Subjects? A Large-Scale Replication of Psychological Experiments with LLMs [1.5031024722977635]
GPT-4 successfully replicates 76.0 percent of main effects and 47.0 percent of interaction effects observed in the original studies. GPT-4's replicated confidence intervals contain the original effect sizes, with the majority of replicated effect sizes exceeding the 95 percent confidence interval of the original studies. Our results demonstrate the potential of LLMs as powerful tools in psychological research but also emphasize the need for caution in interpreting AI-driven findings.
arXiv Detail & Related papers (2024-08-29T05:18:50Z)
Are Large Language Models Strategic Decision Makers? A Study of Performance and Bias in Two-Player Non-Zero-Sum Games [56.70628673595041]
Large Language Models (LLMs) have been increasingly used in real-world settings, yet their strategic decision-making abilities remain largely unexplored. This work investigates the performance and merits of LLMs in canonical game-theoretic two-player non-zero-sum games, Stag Hunt and Prisoner Dilemma. Our structured evaluation of GPT-3.5, GPT-4-Turbo, GPT-4o, and Llama-3-8B shows that these models, when making decisions in these games, are affected by at least one of the following systematic biases.
arXiv Detail & Related papers (2024-07-05T12:30:02Z)
Investigate-Consolidate-Exploit: A General Strategy for Inter-Task Agent Self-Evolution [92.84441068115517]
Investigate-Consolidate-Exploit (ICE) is a novel strategy for enhancing the adaptability and flexibility of AI agents. ICE promotes the transfer of knowledge between tasks for genuine self-evolution. Our experiments on the XAgent framework demonstrate ICE's effectiveness, reducing API calls by as much as 80%.
arXiv Detail & Related papers (2024-01-25T07:47:49Z)
How Well Does GPT-4V(ision) Adapt to Distribution Shifts? A Preliminary Investigation [90.93999543169296]
GPT-4V acts as the most advanced publicly accessible multimodal foundation model. This study rigorously evaluates GPT-4V's adaptability and generalization capabilities in dynamic environments.
arXiv Detail & Related papers (2023-12-12T16:48:07Z)
Can large language models provide useful feedback on research papers? A large-scale empirical analysis [38.905758846360435]
High-quality peer reviews are increasingly difficult to obtain. With the breakthrough of large language models (LLM) such as GPT-4, there is growing interest in using LLMs to generate scientific feedback. We created an automated pipeline using GPT-4 to provide comments on the full PDFs of scientific papers.
arXiv Detail & Related papers (2023-10-03T04:14:17Z)
Instructed to Bias: Instruction-Tuned Language Models Exhibit Emergent Cognitive Bias [57.42417061979399]
Recent studies show that instruction tuning (IT) and reinforcement learning from human feedback (RLHF) improve the abilities of large language models (LMs) dramatically. In this work, we investigate the effect of IT and RLHF on decision making and reasoning in LMs. Our findings highlight the presence of these biases in various models from the GPT-3, Mistral, and T5 families.
arXiv Detail & Related papers (2023-08-01T01:39:25Z)
Preference Ranking Optimization for Human Alignment [90.6952059194946]
Large language models (LLMs) often contain misleading content, emphasizing the need to align them with human values. Reinforcement learning from human feedback (RLHF) has been employed to achieve this alignment. We propose Preference Ranking Optimization (PRO) as an efficient SFT algorithm to fine-tune LLMs for human alignment.
arXiv Detail & Related papers (2023-06-30T09:07:37Z)
An Empirical Study on Information Extraction using Large Language Models [36.090082785047855]
Human-like large language models (LLMs) have proven to be very helpful for many natural language processing (NLP) related tasks. We propose and analyze the effects of a series of simple prompt-based methods on GPT-4's information extraction ability.
arXiv Detail & Related papers (2023-05-23T18:17:43Z)
Sparks of Artificial General Intelligence: Early experiments with GPT-4 [66.1188263570629]
GPT-4, developed by OpenAI, was trained using an unprecedented scale of compute and data. We demonstrate that GPT-4 can solve novel and difficult tasks that span mathematics, coding, vision, medicine, law, psychology and more. We believe GPT-4 could reasonably be viewed as an early (yet still incomplete) version of an artificial general intelligence (AGI) system.
arXiv Detail & Related papers (2023-03-22T16:51:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.