Related papers: The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind

The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind

URL: http://arxiv.org/abs/2506.20664v1
Date: Wed, 25 Jun 2025 17:55:27 GMT
Title: The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind
Authors: Andrei Lupu, Timon Willi, Jakob Foerster,
Abstract summary: Decrypto is a game-based benchmark for multi-agent reasoning and ToM.<n>It is the first platform for designing interactive ToM experiments.<n>We find that LLM game-playing abilities lag behind humans and simple word-embedding baselines.
Score: 8.341160422849969
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As Large Language Models (LLMs) gain agentic abilities, they will have to navigate complex multi-agent scenarios, interacting with human users and other agents in cooperative and competitive settings. This will require new reasoning skills, chief amongst them being theory of mind (ToM), or the ability to reason about the "mental" states of other agents. However, ToM and other multi-agent abilities in LLMs are poorly understood, since existing benchmarks suffer from narrow scope, data leakage, saturation, and lack of interactivity. We thus propose Decrypto, a game-based benchmark for multi-agent reasoning and ToM drawing inspiration from cognitive science, computational pragmatics and multi-agent reinforcement learning. It is designed to be as easy as possible in all other dimensions, eliminating confounding factors commonly found in other benchmarks. To our knowledge, it is also the first platform for designing interactive ToM experiments. We validate the benchmark design through comprehensive empirical evaluations of frontier LLMs, robustness studies, and human-AI cross-play experiments. We find that LLM game-playing abilities lag behind humans and simple word-embedding baselines. We then create variants of two classic cognitive science experiments within Decrypto to evaluate three key ToM abilities. Surprisingly, we find that state-of-the-art reasoning models are significantly worse at those tasks than their older counterparts. This demonstrates that Decrypto addresses a crucial gap in current reasoning and ToM evaluations, and paves the path towards better artificial agents.

Related papers

MultiMind: Enhancing Werewolf Agents with Multimodal Reasoning and Theory of Mind [17.2922544295112]
MultiMind is the first framework integrating multimodal information into social deduction agents.<n>It processes facial expressions and vocal tones alongside verbal content, while employing a Theory of Mind (ToM) model.<n>By combining this ToM model with Monte Carlo Tree Search (MCTS), our agent identifies communication strategies that minimize suspicion directed at itself.
arXiv Detail & Related papers (2025-04-25T03:12:43Z)
ReMA: Learning to Meta-think for LLMs with Multi-Agent Reinforcement Learning [53.817538122688944]
We introduce Reinforced Meta-thinking Agents (ReMA) to elicit meta-thinking behaviors from Reasoning of Large Language Models (LLMs)<n>ReMA decouples the reasoning process into two hierarchical agents: a high-level meta-thinking agent responsible for generating strategic oversight and plans, and a low-level reasoning agent for detailed executions.<n> Empirical results from single-turn experiments demonstrate that ReMA outperforms single-agent RL baselines on complex reasoning tasks.
arXiv Detail & Related papers (2025-03-12T16:05:31Z)
Approximating Human Strategic Reasoning with LLM-Enhanced Recursive Reasoners Leveraging Multi-agent Hypergames [3.5083201638203154]
We implement a role-based multi-agent strategic interaction framework tailored to sophisticated reasoners.<n>We use one-shot, 2-player beauty contests to evaluate the reasoning capabilities of the latest LLMs.<n>Our experiments show that artificial reasoners can outperform the baseline model in terms of both approximating human behaviour and reaching the optimal solution.
arXiv Detail & Related papers (2025-02-11T10:37:20Z)
Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection [31.38516078163367]
ToM-agent is designed to empower LLMs-based generative agents to simulate ToM in open-domain conversational interactions.<n>ToM-agent disentangles the confidence from mental states, facilitating the emulation of an agent's perception of its counterpart's mental states.<n>Our findings indicate that the ToM-agent can grasp the underlying reasons for their counterpart's behaviors beyond mere semantic-emotional supporting or decision-making based on common sense.
arXiv Detail & Related papers (2025-01-26T00:32:38Z)
MageBench: Bridging Large Multimodal Models to Agents [90.59091431806793]
LMMs have shown impressive visual understanding capabilities, with the potential to be applied in agents.<n>Existing benchmarks mostly assess their reasoning abilities in language part.<n>MageBench is a reasoning capability oriented multimodal agent benchmark.
arXiv Detail & Related papers (2024-12-05T17:08:19Z)
Hypothetical Minds: Scaffolding Theory of Mind for Multi-Agent Tasks with Large Language Models [4.9108308035618515]
Multi-agent reinforcement learning (MARL) methods struggle with the non-stationarity of multi-agent systems.<n>Here, we leverage large language models (LLMs) to create an autonomous agent that can handle these challenges.<n>Our agent, Hypothetical Minds, consists of a cognitively-inspired architecture, featuring modular components for perception, memory, and hierarchical planning over two levels of abstraction.
arXiv Detail & Related papers (2024-07-09T17:57:15Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [98.18244218156492]
Large Language Models (LLMs) have significantly advanced natural language processing.<n>As their applications expand into multi-agent environments, there arises a need for a comprehensive evaluation framework.<n>This work introduces a novel competition-based benchmark framework to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
The Rise and Potential of Large Language Model Based Agents: A Survey [91.71061158000953]
Large language models (LLMs) are regarded as potential sparks for Artificial General Intelligence (AGI) We start by tracing the concept of agents from its philosophical origins to its development in AI, and explain why LLMs are suitable foundations for agents. We explore the extensive applications of LLM-based agents in three aspects: single-agent scenarios, multi-agent scenarios, and human-agent cooperation.
arXiv Detail & Related papers (2023-09-14T17:12:03Z)
Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-Collaboration [116.09561564489799]
Solo Performance Prompting transforms a single LLM into a cognitive synergist by engaging in multi-turn self-collaboration with multiple personas. A cognitive synergist is an intelligent agent that collaboratively combines multiple minds' strengths and knowledge to enhance problem-solving in complex tasks. Our in-depth analysis shows that assigning multiple fine-grained personas in LLMs improves problem-solving abilities compared to using a single or fixed number of personas.
arXiv Detail & Related papers (2023-07-11T14:45:19Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.