Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs
- URL: http://arxiv.org/abs/2508.10142v2
- Date: Tue, 19 Aug 2025 21:37:57 GMT
- Title: Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs
- Authors: Kartikeya Badola, Jonathan Simon, Arian Hosseini, Sara Marie Mc Carthy, Tsendsuren Munkhdalai, Abhimanyu Goyal, Tomáš Kočiský, Shyam Upadhyay, Bahare Fatemi, Mehran Kazemi,
- Abstract summary: Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments.<n>This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios.
- Score: 12.176547302474528
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments or interactive tasks which are common in most real-world scenarios. This highlights the critical need for developing LLMs that can effectively engage in logically consistent multi-turn dialogue, seek information and reason with incomplete data. To this end, we introduce a novel benchmark comprising a suite of multi-turn tasks each designed to test specific reasoning, interactive dialogue, and information-seeking abilities. These tasks have deterministic scoring mechanisms, thus eliminating the need for human intervention. Evaluating frontier models on our benchmark reveals significant headroom. Our analysis shows that most errors emerge from poor instruction following, reasoning failures, and poor planning. This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios and offers a robust platform for future research aimed at improving these critical capabilities.
Related papers
- LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning [6.96644195073436]
We develop a framework for task-oriented dialogues grounded in realistic reasoning scenarios.<n>Our method generates dialogues grounded in authentic task scenarios, enriched with real-world information.<n>The resulting dataset serves as a valuable benchmark for assessing and advancing the realistic logical reasoning capabilities of large language models.
arXiv Detail & Related papers (2026-02-27T02:23:37Z) - Interactive Evaluation of Large Language Models for Multi-Requirement Software Engineering Tasks [15.072898489107887]
We build on DevAI, a benchmark of 55 programming tasks, by adding ground-truth solutions and evaluating the relevance and utility of interviewer hints.<n>Our results highlight the importance of dynamic evaluation in advancing the development of collaborative code-generating agents.
arXiv Detail & Related papers (2025-08-26T10:22:37Z) - Teaching Language Models To Gather Information Proactively [53.85419549904644]
Large language models (LLMs) are increasingly expected to function as collaborative partners.<n>In this work, we introduce a new task paradigm: proactive information gathering.<n>We design a scalable framework that generates partially specified, real-world tasks, masking key information.<n>Within this setup, our core innovation is a reinforcement finetuning strategy that rewards questions that elicit genuinely new, implicit user information.
arXiv Detail & Related papers (2025-07-28T23:50:09Z) - Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization [9.994839971567542]
We present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs.<n>Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality.
arXiv Detail & Related papers (2025-07-02T21:02:41Z) - From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? [34.959850282872594]
We present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills.<n>AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers.<n> Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning.
arXiv Detail & Related papers (2025-06-09T23:56:41Z) - MARS-Bench: A Multi-turn Athletic Real-world Scenario Benchmark for Dialogue Evaluation [50.92800625083123]
Large Language Models (textbfLLMs) have been widely adopted in real-world dialogue applications.<n>MARS-Bench is constructed from play-by-play text commentary so to feature realistic dialogues.<n>Experiments on MARS-Bench also reveal that closed-source LLMs significantly outperform open-source alternatives.
arXiv Detail & Related papers (2025-05-27T10:28:04Z) - MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation [56.87891213797931]
We present MTR-Bench for Large Language Models' Multi-Turn Reasoning evaluation.<n>Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities.<n>MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations.
arXiv Detail & Related papers (2025-05-21T17:59:12Z) - Why Reasoning Matters? A Survey of Advancements in Multimodal Reasoning (v1) [66.51642638034822]
Reasoning is central to human intelligence, enabling structured problem-solving across diverse tasks.<n>Recent advances in large language models (LLMs) have greatly enhanced their reasoning abilities in arithmetic, commonsense, and symbolic domains.<n>This paper offers a concise yet insightful overview of reasoning techniques in both textual and multimodal LLMs.
arXiv Detail & Related papers (2025-04-04T04:04:56Z) - CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation [53.452699232071495]
We introduce CrossWordBench, a benchmark designed to evaluate the reasoning capabilities of Large Language Models (LLMs) and Large Vision-Language Models (LVLMs) through the medium of crossword puzzles.<n>Our evaluation reveals that reasoning LLMs substantially outperform non-reasoning models by effectively leveraging crossing-letter constraints.<n>Our findings highlight limitations of the reasoning capabilities of current LLMs and LVLMs, and provide an effective approach for creating multimodal constrained tasks for future evaluations.
arXiv Detail & Related papers (2025-03-30T20:03:36Z) - Puzzle Solving using Reasoning of Large Language Models: A Survey [1.9939549451457024]
This survey examines the capabilities of Large Language Models (LLMs) in puzzle solving.
Our findings highlight the disparity between LLM capabilities and human-like reasoning.
The survey underscores the necessity for novel strategies and richer datasets to advance LLMs' puzzle-solving proficiency.
arXiv Detail & Related papers (2024-02-17T14:19:38Z) - Exploring the Factual Consistency in Dialogue Comprehension of Large Language Models [51.75805497456226]
This work focuses on the factual consistency issue with the help of the dialogue summarization task.
Our evaluation shows that, on average, 26.8% of the summaries generated by LLMs contain factual inconsistency.
To stimulate and enhance the dialogue comprehension ability of LLMs, we propose a fine-tuning paradigm with auto-constructed multi-task data.
arXiv Detail & Related papers (2023-11-13T09:32:12Z) - Self-Explanation Prompting Improves Dialogue Understanding in Large
Language Models [52.24756457516834]
We propose a novel "Self-Explanation" prompting strategy to enhance the comprehension abilities of Large Language Models (LLMs)
This task-agnostic approach requires the model to analyze each dialogue utterance before task execution, thereby improving performance across various dialogue-centric tasks.
Experimental results from six benchmark datasets confirm that our method consistently outperforms other zero-shot prompts and matches or exceeds the efficacy of few-shot prompts.
arXiv Detail & Related papers (2023-09-22T15:41:34Z) - Re-Reading Improves Reasoning in Large Language Models [87.46256176508376]
We introduce a simple, yet general and effective prompting method, Re2, to enhance the reasoning capabilities of off-the-shelf Large Language Models (LLMs)
Unlike most thought-eliciting prompting methods, such as Chain-of-Thought (CoT), Re2 shifts the focus to the input by processing questions twice, thereby enhancing the understanding process.
We evaluate Re2 on extensive reasoning benchmarks across 14 datasets, spanning 112 experiments, to validate its effectiveness and generality.
arXiv Detail & Related papers (2023-09-12T14:36:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.