Related papers: Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games

URL: http://arxiv.org/abs/2310.01468v3
Date: Tue, 20 Feb 2024 21:24:43 GMT
Title: Probing the Multi-turn Planning Capabilities of LLMs via 20 Question Games
Authors: Yizhe Zhang, Jiarui Lu, Navdeep Jaitly
Abstract summary: Large language models (LLMs) are effective at answering questions that are clearly asked. When faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively.
Score: 14.063311955315077
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language models (LLMs) are effective at answering questions that are clearly asked. However, when faced with ambiguous queries they can act unpredictably and produce incorrect outputs. This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively. This capability requires complex understanding, state tracking, reasoning and planning over multiple conversational turns. However, directly measuring this can be challenging. In this paper, we offer a surrogate problem which assesses an LLMs's capability to deduce an entity unknown to itself, but revealed to a judge, by asking the judge a series of queries. This \textit{entity-deducing game} can serve as an evaluation framework to probe the conversational reasoning and planning capabilities of language models. We systematically evaluate various LLMs and discover significant differences in their performance on this task. We find that strong LLMs like GPT-4 outperform human players by a large margin. We further employ Behavior Cloning (BC) to examine whether a weaker model is capable of imitating a stronger model and generalizing to data or domains, using only the demonstrations from a stronger model. We finally propose to use Reinforcement Learning to enhance reasoning and planning capacity of Vicuna models through episodes of game playing, which lead to significant performance improvement. We hope that this problem offers insights into how autonomous agents could be trained to behave more intelligently in ambiguous circumstances.

Related papers

Under the Influence: Quantifying Persuasion and Vigilance in Large Language Models [13.754658024896612]
We study the abilities of Large Language Models to persuade and be rationally vigilant towards other LLM agents.<n>We find that puzzle-solving performance, persuasive capability, and vigilance are dissociable capacities in LLMs.<n>Our work presents the first investigation of the relationship between persuasion, vigilance, and task performance in LLMs.
arXiv Detail & Related papers (2026-02-24T04:09:21Z)
Do Reasoning Models Ask Better Questions? A Formal Information-Theoretic Analysis on Multi-Turn LLM Games [0.0]
Large Language Models (LLMs) excel at many tasks but struggle with a critical ability for resolving ambiguity in user requests.<n>We propose a multi-turn dialogue framework that quantitatively measures how effectively LLMs gather information through yes/no questions.<n>Our experiments demonstrate that, among the evaluated models, the ones with explicit reasoning capabilities achieve higher IG per turn and reach solutions in fewer steps.
arXiv Detail & Related papers (2026-01-25T06:38:15Z)
Multi-Agent Evolve: LLM Self-Improve through Co-evolution [53.00458074754831]
Reinforcement Learning (RL) has demonstrated significant potential in enhancing the reasoning capabilities of large language models (LLMs)<n>Recent Self-Play RL methods, inspired by the success of the paradigm in games and Go, aim to enhance LLM reasoning capabilities without human-annotated data.<n>We propose Multi-Agent Evolve (MAE), a framework that enables LLMs to self-evolve in solving diverse tasks, including mathematics, reasoning, and general knowledge Q&A.
arXiv Detail & Related papers (2025-10-27T17:58:02Z)
Agent-Based Detection and Resolution of Incompleteness and Ambiguity in Interactions with Large Language Models [0.9856777842758593]
This paper examines the use of agent-based architecture to bolster LLM-based Question-Answering systems with additional reasoning capabilities.<n>We equip different LLMs with agents that act as specialists in detecting and resolving deficiencies of incompleteness and ambiguity.<n>Suggesting the agent-based approach could be a useful mechanism to harness the power of LLMs to develop more robust QA systems.
arXiv Detail & Related papers (2025-07-04T17:28:33Z)
Scaling Autonomous Agents via Automatic Reward Modeling And Planning [52.39395405893965]
Large language models (LLMs) have demonstrated remarkable capabilities across a range of tasks. However, they still struggle with problems requiring multi-step decision-making and environmental feedback. We propose a framework that can automatically learn a reward model from the environment without human annotations.
arXiv Detail & Related papers (2025-02-17T18:49:25Z)
Reasoning with Large Language Models, a Survey [2.831296564800826]
This paper reviews the rapidly expanding field of prompt-based reasoning with LLMs. Our taxonomy identifies different ways to generate, evaluate, and control multi-step reasoning. We find that self-improvement, self-reflection, and some meta abilities of the reasoning processes are possible through the judicious use of prompts.
arXiv Detail & Related papers (2024-07-16T08:49:35Z)
Crafting Interpretable Embeddings by Asking LLMs Questions [89.49960984640363]
Large language models (LLMs) have rapidly improved text embeddings for a growing array of natural-language processing tasks. We introduce question-answering embeddings (QA-Emb), embeddings where each feature represents an answer to a yes/no question asked to an LLM. We use QA-Emb to flexibly generate interpretable models for predicting fMRI voxel responses to language stimuli.
arXiv Detail & Related papers (2024-05-26T22:30:29Z)
Optimizing Language Model's Reasoning Abilities with Weak Supervision [48.60598455782159]
We present textscPuzzleBen, a weakly supervised benchmark that comprises 25,147 complex questions, answers, and human-generated rationales. A unique aspect of our dataset is the inclusion of 10,000 unannotated questions, enabling us to explore utilizing fewer supersized data to boost LLMs' inference capabilities.
arXiv Detail & Related papers (2024-05-07T07:39:15Z)
You don't need a personality test to know these models are unreliable: Assessing the Reliability of Large Language Models on Psychometric Instruments [37.03210795084276]
We examine whether the current format of prompting Large Language Models elicits responses in a consistent and robust manner. Our experiments on 17 different LLMs reveal that even simple perturbations significantly downgrade a model's question-answering ability. Our results suggest that the currently widespread practice of prompting is insufficient to accurately and reliably capture model perceptions.
arXiv Detail & Related papers (2023-11-16T09:50:53Z)
MAgIC: Investigation of Large Language Model Powered Multi-Agent in Cognition, Adaptability, Rationality and Collaboration [102.41118020705876]
Large Language Models (LLMs) have marked a significant advancement in the field of natural language processing. As their applications extend into multi-agent environments, a need has arisen for a comprehensive evaluation framework. This work introduces a novel benchmarking framework specifically tailored to assess LLMs within multi-agent settings.
arXiv Detail & Related papers (2023-11-14T21:46:27Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
How FaR Are Large Language Models From Agents with Theory-of-Mind? [69.41586417697732]
We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D) T4D requires models to connect inferences about others' mental states to actions in social scenarios. We introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges.
arXiv Detail & Related papers (2023-10-04T06:47:58Z)
Are Large Language Models Really Robust to Word-Level Perturbations? [68.60618778027694]
We propose a novel rational evaluation approach that leverages pre-trained reward models as diagnostic tools. Longer conversations manifest the comprehensive grasp of language models in terms of their proficiency in understanding questions. Our results demonstrate that LLMs frequently exhibit vulnerability to word-level perturbations that are commonplace in daily language usage.
arXiv Detail & Related papers (2023-09-20T09:23:46Z)
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution. Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.