Related papers: TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues

URL: http://arxiv.org/abs/2511.15976v1
Date: Thu, 20 Nov 2025 02:10:30 GMT
Title: TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues
Authors: Sarik Ghazarian, Abhinav Gullapalli, Swair Shah, Anurag Beniwal, Nanyun Peng, Narayanan Sadagopan, Zhou Yu,
Abstract summary: In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions.<n>Existing TOD benchmarks often oversimplify the complex nature of these instructions.<n>We propose TOD-ProcBench, a benchmark featuring complex process instructions with intricate, fine-grained constraints.
Score: 42.22263009001713
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In real-world task-oriented dialogue (TOD) settings, agents are required to strictly adhere to complex instructions while conducting multi-turn conversations with customers. These instructions are typically presented in natural language format and include general guidelines and step-by-step procedures with complex constraints. Existing TOD benchmarks often oversimplify the complex nature of these instructions by reducing them to simple schemas composed of intents, slots, and API call configurations. To address this gap and systematically benchmark LLMs' instruction-following capabilities, we propose TOD-ProcBench, a challenging benchmark featuring complex process instructions with intricate, fine-grained constraints that evaluates various LLMs' abilities to understand and follow instructions in multi-turn TODs. Our benchmark dataset comprises instruction documents derived from the high-quality ABCD dataset with corresponding conversations under human quality control. We formulate fine-grained constraints and action procedures as multi-level condition-action instruction statements. We design three tasks to comprehensively benchmark LLMs' complex instruction-following capabilities in multi-turn TODs. Task 1 evaluates how LLMs retrieve the most relevant statement from a complex instruction and predict the corresponding next action. In Task 2, we synthesize instruction-violating responses by injecting inconsistencies and manipulating the original instructions, and then we analyze how effectively LLMs can identify instruction-violating responses. Task 3 investigates LLMs' abilities in conditional generation of instruction-following responses based on the original complex instructions. Additionally, we conduct studies on the impact of multilingual settings and different instruction text formats on compliance performance. We release our benchmark under the Llama 3.3 Community License Agreement.

Related papers

RELIC: Evaluating Compositional Instruction Following via Language Recognition [37.49115450182637]
Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context.<n>We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition.
arXiv Detail & Related papers (2025-06-05T16:17:24Z)
Constraint Back-translation Improves Complex Instruction Following of Large Language Models [55.60192044049083]
Large language models (LLMs) struggle to follow instructions with complex constraints in format, length, etc.<n>Previous works conduct post-training on complex instruction-response pairs generated by feeding complex instructions to advanced LLMs.<n>We propose a novel data generation technique, constraint back-translation.
arXiv Detail & Related papers (2024-10-31T17:42:26Z)
Benchmarking Complex Instruction-Following with Multiple Constraints Composition [72.82640456309821]
How to evaluate the ability of complex instruction-following of large language models (LLMs) has become a critical research problem. Existing benchmarks mainly focus on modeling different types of constraints in human instructions while neglecting the composition of different constraints. We propose ComplexBench, a benchmark for comprehensively evaluating the ability of LLMs to follow complex instructions composed of multiple constraints.
arXiv Detail & Related papers (2024-07-04T14:50:45Z)
Benchmarking Large Language Models on Controllable Generation under Diversified Instructions [34.89012022437519]
Large language models (LLMs) have exhibited impressive instruction-following capabilities. It is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. We propose a new benchmark CoDI-Eval to evaluate LLMs' responses to instructions with various constraints.
arXiv Detail & Related papers (2024-01-01T07:35:31Z)
kNN-ICL: Compositional Task-Oriented Parsing Generalization with Nearest Neighbor In-Context Learning [50.40636157214161]
Task-Oriented Parsing (TOP) enables conversational assistants to interpret user commands expressed in natural language. LLMs have achieved impressive performance in computer programs based on a natural language prompt. This paper focuses on harnessing the capabilities of LLMs for semantic parsing tasks.
arXiv Detail & Related papers (2023-12-17T17:26:50Z)
Can Large Language Models Understand Real-World Complex Instructions? [54.86632921036983]
Large language models (LLMs) can understand human instructions, but struggle with complex instructions. Existing benchmarks are insufficient to assess LLMs' ability to understand complex instructions. We propose CELLO, a benchmark for evaluating LLMs' ability to follow complex instructions systematically.
arXiv Detail & Related papers (2023-09-17T04:18:39Z)
CINS: Comprehensive Instruction for Few-shot Learning in Task-oriented Dialog Systems [56.302581679816775]
This paper proposes Comprehensive Instruction (CINS) that exploits PLMs with task-specific instructions. We design a schema (definition, constraint, prompt) of instructions and their customized realizations for three important downstream tasks in ToD. Experiments are conducted on these ToD tasks in realistic few-shot learning scenarios with small validation data.
arXiv Detail & Related papers (2021-09-10T03:23:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.