Related papers: How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games

How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games

URL: http://arxiv.org/abs/2412.12362v1
Date: Mon, 16 Dec 2024 21:25:45 GMT
Title: How Different AI Chatbots Behave? Benchmarking Large Language Models in Behavioral Economics Games
Authors: Yutong Xie, Yiyao Liu, Zhuang Ma, Lin Shi, Xiyuan Wang, Walter Yuan, Matthew O. Jackson, Qiaozhu Mei,
Abstract summary: This paper presents a comprehensive analysis of five leading large language models (LLMs) as they navigate a series of behavioral economics games.<n>We aim to uncover and document both common and distinct behavioral patterns across a range of scenarios.<n>The findings provide valuable insights into the strategic preferences of each LLM, highlighting potential implications for their deployment in critical decision-making roles.
Score: 20.129667072835773
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: The deployment of large language models (LLMs) in diverse applications requires a thorough understanding of their decision-making strategies and behavioral patterns. As a supplement to a recent study on the behavioral Turing test, this paper presents a comprehensive analysis of five leading LLM-based chatbot families as they navigate a series of behavioral economics games. By benchmarking these AI chatbots, we aim to uncover and document both common and distinct behavioral patterns across a range of scenarios. The findings provide valuable insights into the strategic preferences of each LLM, highlighting potential implications for their deployment in critical decision-making roles.

Related papers

VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models [121.03333569013148]
We introduce VisuLogic: a benchmark of 1,000 human-verified problems across six categories. These types of questions can be evaluated to assess the visual reasoning capabilities of MLLMs from multiple perspectives. Most models score below 30% accuracy-only slightly above the 25% random baseline and far below the 51.4% achieved by humans.
arXiv Detail & Related papers (2025-04-21T17:59:53Z)
LogiDynamics: Unraveling the Dynamics of Logical Inference in Large Language Model Reasoning [49.58786377307728]
This paper adopts an exploratory approach by introducing a controlled evaluation environment for analogical reasoning. We analyze the comparative dynamics of inductive, abductive, and deductive inference pipelines. We investigate advanced paradigms such as hypothesis selection, verification, and refinement, revealing their potential to scale up logical inference.
arXiv Detail & Related papers (2025-02-16T15:54:53Z)
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments [19.366120861935105]
Large Language Models (LLMs) show significant potential in economic and strategic interactions. These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems. We introduce a benchmark for standardizing research on two-player, sequential, language-based games.
arXiv Detail & Related papers (2024-10-07T17:55:35Z)
LangSuitE: Planning, Controlling and Interacting with Large Language Models in Embodied Text Environments [70.91258869156353]
We introduce LangSuitE, a versatile and simulation-free testbed featuring 6 representative embodied tasks in textual embodied worlds. Compared with previous LLM-based testbeds, LangSuitE offers adaptability to diverse environments without multiple simulation engines. We devise a novel chain-of-thought (CoT) schema, EmMem, which summarizes embodied states w.r.t. history information.
arXiv Detail & Related papers (2024-06-24T03:36:29Z)
Rethinking ChatGPT's Success: Usability and Cognitive Behaviors Enabled by Auto-regressive LLMs' Prompting [5.344199202349884]
We analyze the structure of modalities within both two types of Large Language Models and six task-specific channels during deployment. We examine the stimulation of diverse cognitive behaviors in LLMs through the adoption of free-form text and verbal contexts.
arXiv Detail & Related papers (2024-05-17T00:19:41Z)
Is English the New Programming Language? How About Pseudo-code Engineering? [0.0]
This study investigates how different input forms impact ChatGPT, a leading language model by OpenAI. It examines the model's proficiency across four categories: understanding of intentions, interpretability, completeness, and creativity.
arXiv Detail & Related papers (2024-04-08T16:28:52Z)
Evaluating Interventional Reasoning Capabilities of Large Language Models [58.52919374786108]
Large language models (LLMs) can estimate causal effects under interventions on different parts of a system. We conduct empirical analyses to evaluate whether LLMs can accurately update their knowledge of a data-generating process in response to an intervention. We create benchmarks that span diverse causal graphs (e.g., confounding, mediation) and variable types, and enable a study of intervention-based reasoning.
arXiv Detail & Related papers (2024-04-08T14:15:56Z)
Characteristic AI Agents via Large Language Models [40.10858767752735]
This research focuses on investigating the performance of Large Language Models in constructing characteristic AI agents. A dataset called Character100'' is built for this benchmark, comprising the most-visited people on Wikipedia for language models to role-play. The experimental results underscore the potential directions for further improvement in the capabilities of LLMs in constructing characteristic AI agents.
arXiv Detail & Related papers (2024-03-19T02:25:29Z)
Comparing Inferential Strategies of Humans and Large Language Models in Deductive Reasoning [25.732397636695882]
We show that large language models (LLMs) display reasoning patterns akin to those observed in humans. Our research demonstrates that the architecture and scale of the model significantly affect its preferred method of reasoning.
arXiv Detail & Related papers (2024-02-20T12:58:14Z)
Generative Judge for Evaluating Alignment [84.09815387884753]
We propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios. Experimentally, Auto-J outperforms a series of strong competitors, including both open-source and closed-source models.
arXiv Detail & Related papers (2023-10-09T07:27:15Z)
Re-Reading Improves Reasoning in Large Language Models [87.46256176508376]
We introduce a simple, yet general and effective prompting method, Re2, to enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs) Unlike most thought-eliciting prompting methods, such as Chain-of-Thought (CoT), Re2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process. We evaluate Re2 on extensive reasoning benchmarks across 14 datasets, spanning 112 experiments, to validate its effectiveness and generality.
arXiv Detail & Related papers (2023-09-12T14:36:23Z)
Pre-Trained Language Models for Interactive Decision-Making [72.77825666035203]
We describe a framework for imitation learning in which goals and observations are represented as a sequence of embeddings. We demonstrate that this framework enables effective generalization across different environments. For test tasks involving novel goals or novel scenes, initializing policies with language models improves task completion rates by 43.6%.
arXiv Detail & Related papers (2022-02-03T18:55:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.