Related papers: LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning

LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning

URL: http://arxiv.org/abs/2602.23610v1
Date: Fri, 27 Feb 2026 02:23:37 GMT
Title: LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning
Authors: Yu Zhu, Kai Yang,
Abstract summary: We develop a framework for task-oriented dialogues grounded in realistic reasoning scenarios.<n>Our method generates dialogues grounded in authentic task scenarios, enriched with real-world information.<n>The resulting dataset serves as a valuable benchmark for assessing and advancing the realistic logical reasoning capabilities of large language models.
Score: 6.96644195073436
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The reasoning capability of large language models (LLMs), defined as their ability to analyze, infer, and make decisions based on input information, is essential for building intelligent task-oriented dialogue systems. However, existing benchmarks do not sufficiently reflect the complexity of real-world scenarios, which limits their effectiveness in evaluating and enhancing LLM reasoning in practical contexts. Many current reasoning datasets are overly simplistic and abstract, often disconnected from realistic task flows, domain constraints, and operational rules, making it difficult to effectively evaluate LLMs' logical reasoning ability. In addition, data contamination from pretraining corpora undermines the reliability of evaluation results, and traditional crowdsourcing methods for dataset construction are labor-intensive and difficult to scale. To address these challenges, we propose a LLM-driven framework for synthesizing multi-turn, task-oriented dialogues grounded in realistic reasoning scenarios, leveraging trilevel optimization to enhance dialogue quality. Our method generates dialogues grounded in authentic task scenarios, enriched with real-world information, and exhibiting strong contextual coherence. Corresponding reasoning tasks are carefully designed around these dialogues and iteratively refined to continuously improve the tasks' quality and challenge. The resulting dataset serves as a valuable benchmark for assessing and advancing the realistic logical reasoning capabilities of LLMs. Experimental results show that our synthetic data-based reasoning tasks introduce non-trivial reasoning challenges and provide meaningful support for improving the reasoning capabilities of LLMs.

Related papers

Pushing LLMs to Their Logical Reasoning Bound: The Role of Data Reasoning Intensity [59.27594125465172]
We introduce Data Reasoning Intensity (DRI), a novel metric that quantifies the latent logical reasoning complexity of samples.<n>We then introduce a re-cognizing optimization strategy that systematically enhances the logical reasoning intensity of training data.
arXiv Detail & Related papers (2025-09-29T14:20:04Z)
Multi-Turn Puzzles: Evaluating Interactive Reasoning and Strategic Dialogue in LLMs [21.192619293355502]
Large language models (LLMs) excel at solving problems with clear and complete statements, but often struggle with nuanced environments.<n>This benchmark provides valuable insights into the strengths and weaknesses of current LLMs in handling complex, interactive scenarios.
arXiv Detail & Related papers (2025-08-13T19:14:45Z)
Reasoning or Not? A Comprehensive Evaluation of Reasoning LLMs for Dialogue Summarization [9.994839971567542]
We present the first comprehensive and systematic evaluation of state-of-the-art reasoning LLMs and non-reasoning LLMs.<n>Contrary to trends in other reasoning-intensive tasks, our findings show that explicit stepwise reasoning does not consistently improve dialogue summarization quality.
arXiv Detail & Related papers (2025-07-02T21:02:41Z)
Rethinking the Unsolvable: When In-Context Search Meets Test-Time Scaling [1.219841051166348]
In this paper, we explore the combined potential of in-context search and test-time scaling on super hard reasoning tasks.<n>We find that by employing advanced in-context search prompting to LLMs augmented with internal scaling, one can achieve transformative performance breakthroughs.
arXiv Detail & Related papers (2025-05-28T12:28:18Z)
On the Eligibility of LLMs for Counterfactual Reasoning: A Decompositional Study [15.617243755155686]
Counterfactual reasoning has emerged as a crucial technique for generalizing the reasoning capabilities of large language models.<n>We propose a decompositional strategy that breaks down the counterfactual generation from causality construction to the reasoning over counterfactual interventions.
arXiv Detail & Related papers (2025-05-17T04:59:32Z)
Evaluating Large Language Models for Real-World Engineering Tasks [75.97299249823972]
This paper introduces a curated database comprising over 100 questions derived from authentic, production-oriented engineering scenarios.<n>Using this dataset, we evaluate four state-of-the-art Large Language Models (LLMs)<n>Our results show that LLMs demonstrate strengths in basic temporal and structural reasoning but struggle significantly with abstract reasoning, formal modeling, and context-sensitive engineering logic.
arXiv Detail & Related papers (2025-05-12T14:05:23Z)
Enhancing Logical Reasoning in Large Language Models through Graph-based Synthetic Data [53.433309883370974]
This work explores the potential and limitations of using graph-based synthetic reasoning data as training signals to enhance Large Language Models' reasoning capabilities.<n>Our experiments, conducted on two established natural language reasoning tasks, demonstrate that supervised fine-tuning with synthetic graph-based reasoning data effectively enhances LLMs' reasoning performance without compromising their effectiveness on other standard evaluation benchmarks.
arXiv Detail & Related papers (2024-09-19T03:39:09Z)
Cognitive LLMs: Towards Integrating Cognitive Architectures and Large Language Models for Manufacturing Decision-making [51.737762570776006]
LLM-ACTR is a novel neuro-symbolic architecture that provides human-aligned and versatile decision-making. Our framework extracts and embeds knowledge of ACT-R's internal decision-making process as latent neural representations. Our experiments on novel Design for Manufacturing tasks show both improved task performance as well as improved grounded decision-making capability.
arXiv Detail & Related papers (2024-08-17T11:49:53Z)
Can Long-Context Language Models Subsume Retrieval, RAG, SQL, and More? [54.667202878390526]
Long-context language models (LCLMs) have the potential to revolutionize our approach to tasks traditionally reliant on external tools like retrieval systems or databases. We introduce LOFT, a benchmark of real-world tasks requiring context up to millions of tokens designed to evaluate LCLMs' performance on in-context retrieval and reasoning. Our findings reveal LCLMs' surprising ability to rival state-of-the-art retrieval and RAG systems, despite never having been explicitly trained for these tasks.
arXiv Detail & Related papers (2024-06-19T00:28:58Z)
K-Level Reasoning: Establishing Higher Order Beliefs in Large Language Models for Strategic Reasoning [76.3114831562989]
It requires Large Language Model (LLM) agents to adapt their strategies dynamically in multi-agent environments. We propose a novel framework: "K-Level Reasoning with Large Language Models (K-R)"
arXiv Detail & Related papers (2024-02-02T16:07:05Z)
Improving Large Language Models in Event Relation Logical Prediction [33.88499005859982]
Event relation extraction is a challenging task that demands thorough semantic understanding and rigorous logical reasoning. In this paper, we conduct an in-depth investigation to systematically explore the capability of LLMs in understanding and applying event relation logic. Our study reveals that LLMs are not logically consistent reasoners, which results in their suboptimal performance on tasks that need rigorous reasoning.
arXiv Detail & Related papers (2023-10-13T14:53:06Z)
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models [61.28463542324576]
Vision-language models (VLMs) have recently demonstrated strong efficacy as visual assistants that can generate human-like outputs. We evaluate existing state-of-the-art VLMs and find that even the best-performing model is unable to demonstrate strong visual reasoning capabilities and consistency. We propose a two-stage training framework aimed at improving both the reasoning performance and consistency of VLMs.
arXiv Detail & Related papers (2023-09-08T17:49:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.