How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs
- URL: http://arxiv.org/abs/2602.08808v1
- Date: Mon, 09 Feb 2026 15:47:14 GMT
- Title: How2Everything: Mining the Web for How-To Procedures to Evaluate and Improve LLMs
- Authors: Yapei Chang, Kyle Lo, Mohit Iyyer, Luca Soldaini,
- Abstract summary: How2Everything is a framework to evaluate and improve goal-conditioned procedure generation.<n>Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics.<n>How2Score is an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal.
- Score: 49.61011897610774
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating step-by-step "how-to" procedures is a key LLM capability: how-to advice is commonly requested in chatbots, and step-by-step planning is critical for reasoning over complex tasks. Yet, measuring and improving procedural validity at scale on real-world tasks remains challenging and understudied. To address this, we introduce How2Everything, a scalable framework to evaluate and improve goal-conditioned procedure generation. Our framework includes How2Mine, which mines 351K procedures from 980K web pages across 14 topics and readily scales to larger corpora. From this pool we build How2Bench, a 7K-example evaluation set balanced across topics. To reliably score model outputs, we develop How2Score, an evaluation protocol that uses an LLM judge to detect whether a generation contains any critical failure that would prevent achieving the goal. For low-cost, reproducible evaluation, we distill a frontier model into an open 8B model, achieving 80.5% agreement with human annotators. How2Bench reveals clear scaling trends across model sizes and training stages, providing signal early in pretraining. Finally, RL using How2Score as a reward improves performance on How2Bench by >10 points across three models without systematic regressions on standard benchmarks, with gains robust to superficial source-document memorization or format compliance. Taken together, How2Everything shows how pretraining web data can support a closed loop of capability evaluation and improvement at scale.
Related papers
- SPARK: Stepwise Process-Aware Rewards for Reference-Free Reinforcement Learning [39.1720897614261]
Process reward models (PRMs) that provide dense, step-level feedback have shown promise for reinforcement learning.<n>We propose SPARK: a three-stage framework where in the first stage a generator model produces diverse solutions and a verifier model evaluates them.<n>We show that aggregating multiple independent verifications at the step level produces training data for process reward models that surpass ground-truth outcome supervision.
arXiv Detail & Related papers (2025-12-02T21:30:47Z) - Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z) - SPARE: Single-Pass Annotation with Reference-Guided Evaluation for Automatic Process Supervision and Reward Modelling [58.05959902776133]
We introduce Single-Pass.<n>with Reference-Guided Evaluation (SPARE), a novel structured framework that enables efficient per-step annotation.<n>We demonstrate SPARE's effectiveness across four diverse datasets spanning mathematical reasoning (GSM8K, MATH), multi-hop question answering (MuSiQue-Ans), and spatial reasoning (SpaRP)<n>On ProcessBench, SPARE demonstrates data-efficient out-of-distribution generalization, using only $sim$16% of training samples compared to human-labeled and other synthetically trained baselines.
arXiv Detail & Related papers (2025-06-18T14:37:59Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models [5.6525926183880255]
We introduce TurnBench, a novel benchmark that evaluates multi-turn, multi-step reasoning through an interactive code-breaking task.<n>In each episode, a model must uncover hidden logical or arithmetic rules by making sequential guesses, receiving structured feedback, and integrating clues across multiple rounds.<n>TurnBench includes two modes: Classic, which tests standard reasoning, and Nightmare, which introduces increased complexity and requires robust inferential chains.
arXiv Detail & Related papers (2025-06-02T05:47:50Z) - Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents [40.73340280747757]
The ReAct capability in large language models (LLMs) has become the foundation of modern agentic systems.<n>We introduce Pre-Act, a novel approach that enhances the agent's performance by creating a multi-step execution plan.<n>Our approach is applicable to both conversational and non-conversational agents.
arXiv Detail & Related papers (2025-05-15T05:17:47Z) - SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning [99.645427839457]
Self-Play Critic (SPC) is a novel approach where a critic model evolves its ability to assess reasoning steps through adversarial self-play games.<n>SPC involves fine-tuning two copies of a base model to play two roles, namely a "sneaky generator" and a "critic"
arXiv Detail & Related papers (2025-04-27T08:45:06Z) - L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution [31.19899557805607]
Complex reasoning tasks often rely on the ability to consistently and accurately apply simple rules across incremental steps.<n>We introduce L0-Bench, a language model benchmark for testing procedural correctness.<n>L0-Bench grades models on their ability to generate step-by-step, error-free execution traces.
arXiv Detail & Related papers (2025-03-28T18:54:56Z) - QualEval: Qualitative Evaluation for Model Improvement [82.73561470966658]
We propose QualEval, which augments quantitative scalar metrics with automated qualitative evaluation as a vehicle for model improvement.
QualEval uses a powerful LLM reasoner and our novel flexible linear programming solver to generate human-readable insights.
We demonstrate that leveraging its insights, for example, improves the absolute performance of the Llama 2 model by up to 15% points relative.
arXiv Detail & Related papers (2023-11-06T00:21:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.