Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models
- URL: http://arxiv.org/abs/2310.20499v2
- Date: Mon, 6 Nov 2023 02:27:34 GMT
- Title: Leveraging Word Guessing Games to Assess the Intelligence of Large
Language Models
- Authors: Tian Liang and Zhiwei He and Jen-tse Huang and Wenxuan Wang and
Wenxiang Jiao and Rui Wang and Yujiu Yang and Zhaopeng Tu and Shuming Shi and
Xing Wang
- Abstract summary: The paper is inspired by the popular language game Who is Spy''
We develop DEEP to evaluate LLMs' expression and disguising abilities.
We then introduce SpyGame, an interactive multi-agent framework.
- Score: 105.39236338147715
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The automatic evaluation of LLM-based agent intelligence is critical in
developing advanced LLM-based agents. Although considerable effort has been
devoted to developing human-annotated evaluation datasets, such as AlpacaEval,
existing techniques are costly, time-consuming, and lack adaptability. In this
paper, inspired by the popular language game ``Who is Spy'', we propose to use
the word guessing game to assess the intelligence performance of LLMs. Given a
word, the LLM is asked to describe the word and determine its identity (spy or
not) based on its and other players' descriptions. Ideally, an advanced agent
should possess the ability to accurately describe a given word using an
aggressive description while concurrently maximizing confusion in the
conservative description, enhancing its participation in the game. To this end,
we first develop DEEP to evaluate LLMs' expression and disguising abilities.
DEEP requires LLM to describe a word in aggressive and conservative modes. We
then introduce SpyGame, an interactive multi-agent framework designed to assess
LLMs' intelligence through participation in a competitive language-based board
game. Incorporating multi-agent interaction, SpyGame requires the target LLM to
possess linguistic skills and strategic thinking, providing a more
comprehensive evaluation of LLMs' human-like cognitive abilities and
adaptability in complex communication situations. The proposed evaluation
framework is very easy to implement. We collected words from multiple sources,
domains, and languages and used the proposed evaluation framework to conduct
experiments. Extensive experiments demonstrate that the proposed DEEP and
SpyGame effectively evaluate the capabilities of various LLMs, capturing their
ability to adapt to novel situations and engage in strategic communication.
Related papers
- Who is Undercover? Guiding LLMs to Explore Multi-Perspective Team Tactic in the Game [3.8284679578037246]
We use the language logic game Who is Undercover?'' as an experimental platform to propose the Multi-Perspective Team Tactic (MPTT) framework.
MPTT aims to cultivate LLMs' human-like language expression logic, multi-dimensional thinking, and self-perception in complex scenarios.
Preliminary results show that MPTT, combined with WIU, leverages LLMs' cognitive capabilities to create a decision-making framework that can simulate real society.
arXiv Detail & Related papers (2024-10-20T06:41:31Z) - Evaluating and Enhancing LLMs Agent based on Theory of Mind in Guandan: A Multi-Player Cooperative Game under Imperfect Information [36.11862095329315]
Large language models (LLMs) have shown success in handling simple games with imperfect information.
This study investigates the applicability of knowledge acquired by open-source and API-based LLMs to sophisticated text-based games.
arXiv Detail & Related papers (2024-08-05T15:36:46Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - GTBench: Uncovering the Strategic Reasoning Limitations of LLMs via Game-Theoretic Evaluations [87.99872683336395]
Large Language Models (LLMs) are integrated into critical real-world applications.
This paper evaluates LLMs' reasoning abilities in competitive environments.
We first propose GTBench, a language-driven environment composing 10 widely recognized tasks.
arXiv Detail & Related papers (2024-02-19T18:23:36Z) - If LLM Is the Wizard, Then Code Is the Wand: A Survey on How Code
Empowers Large Language Models to Serve as Intelligent Agents [81.60906807941188]
Large language models (LLMs) are trained on a combination of natural language and formal language (code)
Code translates high-level goals into executable steps, featuring standard syntax, logical consistency, abstraction, and modularity.
arXiv Detail & Related papers (2024-01-01T16:51:20Z) - LMRL Gym: Benchmarks for Multi-Turn Reinforcement Learning with Language
Models [56.25156596019168]
This paper introduces the LMRL-Gym benchmark for evaluating multi-turn RL for large language models (LLMs)
Our benchmark consists of 8 different language tasks, which require multiple rounds of language interaction and cover a range of tasks in open-ended dialogue and text games.
arXiv Detail & Related papers (2023-11-30T03:59:31Z) - Language Model-In-The-Loop: Data Optimal Approach to Learn-To-Recommend
Actions in Text Games [16.281640651021434]
Large Language Models (LLMs) have demonstrated superior performance in language understanding benchmarks.
LLMs leverage linguistic priors of LLMs -- GPT-2 -- for action candidate recommendations to improve the performance in text games.
CalM adapts GPT-2 with annotated human gameplays and keeps the LLM fixed during the learning of the text based games.
arXiv Detail & Related papers (2023-11-13T19:12:49Z) - Language Agents with Reinforcement Learning for Strategic Play in the
Werewolf Game [40.438765131992525]
We develop strategic language agents that generate flexible language actions and possess strong decision-making abilities.
To mitigate the intrinsic bias in language actions, our agents use an LLM to perform deductive reasoning and generate a diverse set of action candidates.
Experiments show that our agents overcome the intrinsic bias and outperform existing LLM-based agents in the Werewolf game.
arXiv Detail & Related papers (2023-10-29T09:02:57Z) - TouchStone: Evaluating Vision-Language Models by Language Models [91.69776377214814]
We propose an evaluation method that uses strong large language models as judges to comprehensively evaluate the various abilities of LVLMs.
We construct a comprehensive visual dialogue dataset TouchStone, consisting of open-world images and questions, covering five major categories of abilities and 27 subtasks.
We demonstrate that powerful LVLMs, such as GPT-4, can effectively score dialogue quality by leveraging their textual capabilities alone.
arXiv Detail & Related papers (2023-08-31T17:52:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.