Related papers: How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs

URL: http://arxiv.org/abs/2602.08808v1
Date: Mon, 09 Feb 2026 15:47:14 GMT
Title: How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs
Authors: Yapei Chang, Kyle Lo, Mohit Iyyer, Luca Soldaini,
Abstract summary: How2Everything is a framework to evaluate and improve goal-conditioned procedure generation.<n>Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics.<n>How2Score is an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal.
Score: 49.61011897610774
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.

Related papers

SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning [39.1720897614261]
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning.<n>We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them.<n>We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision.
arXiv Detail & Related papers (2025-12-02T21:30:47Z)
Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z)
SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z)
Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models [5.6525926183880255]
We introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task.<n>In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds.<n>TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains.
arXiv Detail & Related papers (2025-06-02T05:47:50Z)
Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents [40.73340280747757]
The ReAct capability in large language models (LLMs) has become the foundation of modern agentic systems.<n>We introduce Pre-Act, a novel approach that enhances the agent's performance by creating a multi-step execution plan.<n>Our approach is applicable to both conversational and non-conversational agents.
arXiv Detail & Related papers (2025-05-15T05:17:47Z)
SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning [99.645427839457]
Self-Play Critic (SPC) is a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games.<n>SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" and a "critic"
arXiv Detail & Related papers (2025-04-27T08:45:06Z)
L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution [31.19899557805607]
Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps.<n>We introduce L0-Bench, a language model benchmark for testing procedural correctness.<n>L0-Bench grades models on their ability to generate step-by-step, error-free execution traces.
arXiv Detail & Related papers (2025-03-28T18:54:56Z)
QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement. QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights. We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.