Related papers: Evaluating LLMs' Reasoning Over Ordered Procedural Steps

Evaluating LLMs' Reasoning Over Ordered Procedural Steps

URL: http://arxiv.org/abs/2511.04688v1
Date: Sat, 25 Oct 2025 23:37:00 GMT
Title: Evaluating LLMs' Reasoning Over Ordered Procedural Steps
Authors: Adrita Anika, Md Messal Monem Miah,
Abstract summary: Reasoning over procedural sequences, where the order of steps directly impacts outcomes, is a critical capability for large language models (LLMs)<n>We study the task of reconstructing globally ordered sequences from shuffled procedural steps using a curated dataset of food recipes.<n>We present a comprehensive evaluation framework that adapts established metrics from ranking and sequence alignment.
Score: 3.9261455058620083
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reasoning over procedural sequences, where the order of steps directly impacts outcomes, is a critical capability for large language models (LLMs). In this work, we study the task of reconstructing globally ordered sequences from shuffled procedural steps, using a curated dataset of food recipes, a domain where correct sequencing is essential for task success. We evaluate several LLMs under zero-shot and few-shot settings and present a comprehensive evaluation framework that adapts established metrics from ranking and sequence alignment. These include Kendall's Tau, Normalized Longest Common Subsequence (NLCS), and Normalized Edit Distance (NED), which capture complementary aspects of ordering quality. Our analysis shows that model performance declines with increasing sequence length, reflecting the added complexity of longer procedures. We also find that greater step displacement in the input, corresponding to more severe shuffling, leads to further degradation. These findings highlight the limitations of current LLMs in procedural reasoning, especially with longer and more disordered inputs.

Related papers

Solving a Million-Step LLM Task with Zero Errors [13.911986576836568]
This paper describes MAKER, the first system that successfully solves a task with over one million LLM steps with zero errors.<n>The results suggest that instead of relying on continual improvement of current LLMs, massively decomposed agentic processes (MDAPs) may provide a way to efficiently solve problems at the level of organizations and societies.
arXiv Detail & Related papers (2025-11-12T06:27:55Z)
Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z)
Iterative Self-Incentivization Empowers Large Language Models as Agentic Searchers [74.17516978246152]
Large language models (LLMs) have been widely integrated into information retrieval to advance traditional techniques.<n>We propose EXSEARCH, an agentic search framework, where the LLM learns to retrieve useful information as the reasoning unfolds.<n>Experiments on four knowledge-intensive benchmarks show that EXSEARCH substantially outperforms baselines.
arXiv Detail & Related papers (2025-05-26T15:27:55Z)
Self-Calibrated Listwise Reranking with Large Language Models [137.6557607279876]
Large language models (LLMs) have been employed in reranking tasks through a sequence-to-sequence approach. This reranking paradigm requires a sliding window strategy to iteratively handle larger candidate sets. We propose a novel self-calibrated listwise reranking method, which aims to leverage LLMs to produce global relevance scores for ranking.
arXiv Detail & Related papers (2024-11-07T10:31:31Z)
Continual LLaVA: Continual Instruction Tuning in Large Vision-Language Models [93.5327725085853]
Continual LLaVA is a rehearsal-free method tailored for continual instruction tuning in LVLMs. Experiments indicate that the proposed Continual LLaVA outperforms previous methods by significantly reducing the forgetting during the continual instruction tuning process.
arXiv Detail & Related papers (2024-11-04T19:55:32Z)
Zeroth-Order Fine-Tuning of LLMs in Random Subspaces [63.10833446782114]
As language models grow in size, memory demands for backpropagation increase.<n>Zeroth-order (ZO) optimization methods offer a memory-efficient alternative.<n>In this paper, we propose Subspace Zero-order optimization to address the challenges posed by posed by high dimensionality perturbations.
arXiv Detail & Related papers (2024-10-11T17:01:43Z)
Instruction Position Matters in Sequence Generation with Large Language Models [67.87516654892343]
Large language models (LLMs) are capable of performing conditional sequence generation tasks, such as translation or summarization. We propose enhancing the instruction-following capability of LLMs by shifting the position of task instructions after the input sentences.
arXiv Detail & Related papers (2023-08-23T12:36:57Z)
STEPS: A Benchmark for Order Reasoning in Sequential Tasks [16.52934509949172]
We describe the data construction and task formulations, and benchmark most of significant Large Language Models (LLMs) The experimental results demonstrate 1) The commonsense reasoning of action orders in sequential tasks are challenging to resolve via zero-shot prompting or few-shot in-context learning.
arXiv Detail & Related papers (2023-06-07T13:58:55Z)
A Simple Explanation for the Phase Transition in Large Language Models with List Decoding [3.898689841227059]
We show that large language models (LLM) exhibit emergent abilities that are not present in small models. We use a list decoder that keeps a list of candidate sequences at each step and defers the generation of the output sequence at the end.
arXiv Detail & Related papers (2023-03-23T09:00:07Z)
MLE-guided parameter search for task loss minimization in neural sequence modeling [83.83249536279239]
Neural autoregressive sequence models are used to generate sequences in a variety of natural language processing (NLP) tasks. We propose maximum likelihood guided parameter search (MGS), which samples from a distribution over update directions that is a mixture of random search around the current parameters and around the maximum likelihood gradient. Our experiments show that MGS is capable of optimizing sequence-level losses, with substantial reductions in repetition and non-termination in sequence completion, and similar improvements to those of minimum risk training in machine translation.
arXiv Detail & Related papers (2020-06-04T22:21:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.