Related papers: One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework

URL: http://arxiv.org/abs/2511.03508v1
Date: Wed, 05 Nov 2025 14:39:59 GMT
Title: One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
Authors: Qi Jia, Kaiwei Zhang, Xiujie Song, Ye Shen, Xiangyang Zhu, Guangtao Zhai,
Abstract summary: Large language models can follow users' instructions throughout a dialogue spanning multiple topics.<n>Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience.<n>We propose a framework for assessing multi-turn instruction-following ability.
Score: 51.50565654314582
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding how well large language models can follow users' instructions throughout a dialogue spanning multiple topics is of great importance for data-intensive conversational applications. Existing benchmarks are often limited to a fixed number of turns, making them susceptible to saturation and failing to account for the user's interactive experience. In this work, we propose an extensible framework for assessing multi-turn instruction-following ability. At its core, our framework decouples linguistic surface forms from user intent simulation through a three-layer mechanism that tracks constraints, instructions, and topics. This framework mimics User-LLM interaction by enabling the dynamic construction of benchmarks with state changes and tracebacks, terminating a conversation only when the model exhausts a simulated user's patience. We define a suite of metrics capturing the quality of the interaction process. Using this framework, we construct EvolIF, an evolving instruction-following benchmark incorporating nine distinct constraint types. Our results indicate that GPT-5 exhibits superior instruction-following performance. It sustains an average of 18.54 conversational turns and demonstrates 70.31% robustness, outperforming Gemini-2.5-Pro by a significant margin of 11.41%, while other models lag far behind. All of the data and code will be made publicly available online.

Related papers

FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback [92.67587639164908]
We present FronTalk, a benchmark for front-end code generation with multi-modal feedback.<n>We focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues.<n> Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature.
arXiv Detail & Related papers (2025-12-05T23:28:09Z)
RMTBench: Benchmarking LLMs Through Multi-Turn User-Centric Role-Playing [133.0641538589466]
RMTBench is a comprehensive textbfuser-centric bilingual role-playing benchmark featuring 80 diverse characters and over 8,000 dialogue rounds.<n>Our benchmark constructs dialogues based on explicit user motivations rather than character descriptions, ensuring alignment with practical user applications.<n>By shifting focus from character background to user intention fulfillment, RMTBench bridges the gap between academic evaluation and practical deployment requirements.
arXiv Detail & Related papers (2025-07-27T16:49:47Z)
A Framework for Generating Conversational Recommendation Datasets from Behavioral Interactions [2.0693204407592836]
We present ConvRecStudio, a framework that simulates realistic, multi-turn dialogs grounded in timestamped user-item interactions and reviews.<n>We apply ConvRecStudio to three domains -- MobileRec, Yelp, and Amazon Electronics -- producing over 12K multi-turn dialogs per dataset.
arXiv Detail & Related papers (2025-06-14T22:58:48Z)
ConsistentChat: Building Skeleton-Guided Consistent Multi-Turn Dialogues for Large Language Models from Scratch [79.12929103519922]
Skeleton-Guided Multi-Turn Dialogue Generation constrains multi-turn instruction synthesis by explicitly modeling human intent.<n>We construct ConsistentChat, a multi-turn instruction dataset with approximately 15,000 multi-turn conversations and 224,392 utterances.<n>Experiments on the Light, Topdial, and MT-Eval benchmarks show that models fine-tuned on ConsistentChat achieve a 20-30% improvement in chat consistency and up to a 15% increase in task success rate.
arXiv Detail & Related papers (2025-06-04T04:21:48Z)
A Multi-Dimensional Constraint Framework for Evaluating and Improving Instruction Following in Large Language Models [48.361839372110246]
We develop an automated instruction generation pipeline that performs constraint expansion, conflict detection, and instruction rewriting.<n>We evaluate 19 large language models and uncover substantial variation in performance across constraint forms.<n>In-depth analysis indicates that these gains stem primarily from modifications in the model's attention modules parameters.
arXiv Detail & Related papers (2025-05-12T14:16:55Z)
Interaction2Code: Benchmarking MLLM-based Interactive Webpage Code Generation from Interactive Prototyping [57.024913536420264]
Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance on the design-to-code task.<n>We present the first systematic investigation of MLLMs in generating interactive webpages.
arXiv Detail & Related papers (2024-11-05T17:40:03Z)
MT-Eval: A Multi-Turn Capabilities Evaluation Benchmark for Large Language Models [70.92847554971065]
We introduce MT-Eval, a comprehensive benchmark designed to evaluate multi-turn conversational abilities. By analyzing human-LLM conversations, we categorize interaction patterns into four types: recollection, expansion, refinement, and follow-up. Our evaluation of 11 well-known LLMs shows that while closed-source models generally surpass open-source ones, certain open-source models exceed GPT-3.5-Turbo in specific tasks.
arXiv Detail & Related papers (2024-01-30T04:50:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.