Related papers: Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling

Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling

URL: http://arxiv.org/abs/2602.06795v1
Date: Fri, 06 Feb 2026 15:51:52 GMT
Title: Generating Data-Driven Reasoning Rubrics for Domain-Adaptive Reward Modeling
Authors: Kate Sanders, Nathaniel Weir, Sapana Chaudhary, Kaj Bostrom, Huzefa Rangwala,
Abstract summary: We propose a data-driven approach to automatically construct highly granular reasoning model error.<n>Rubrics can be used to build stronger LLM-as-judge reward functions.<n>This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels.
Score: 21.45871501724415
License: http://creativecommons.org/licenses/by/4.0/
Abstract: An impediment to using Large Language Models (LLMs) for reasoning output verification is that LLMs struggle to reliably identify errors in thinking traces, particularly in long outputs, domains requiring expert knowledge, and problems without verifiable rewards. We propose a data-driven approach to automatically construct highly granular reasoning error taxonomies to enhance LLM-driven error detection on unseen reasoning traces. Our findings indicate that classification approaches that leverage these error taxonomies, or "rubrics", demonstrate strong error identification compared to baseline methods in technical domains like coding, math, and chemical engineering. These rubrics can be used to build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning. Experimental results show that these rewards have the potential to improve models' task accuracy on difficult domains over models trained by general LLMs-as-judges by +45%, and approach performance of models trained by verifiable rewards while using as little as 20% as many gold labels. Through our approach, we extend the usage of reward rubrics from assessing qualitative model behavior to assessing quantitative model correctness on tasks typically learned via RLVR rewards. This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels, which are often highly costly to procure.

Related papers

Integrating Domain Knowledge for Financial QA: A Multi-Retriever RAG Approach with LLMs [13.368251290146794]
We implement a multi-retriever Retrieval Augmented Generators system to retrieve both external domain knowledge and internal question contexts.<n>We find that domain-specific training with the SecBERT encoder significantly contributes to our best neural symbolic model.
arXiv Detail & Related papers (2025-12-29T20:24:15Z)
Multimodal Reinforcement Learning with Agentic Verifier for AI Agents [131.46008226323423]
Argos is a principled multimodal reward agent to train reasoning models for agentic tasks.<n>By leveraging our agentic verifier across both SFT data and RL training, our model achieves state-of-the-art results.
arXiv Detail & Related papers (2025-12-03T04:42:47Z)
CAPO: Towards Enhancing LLM Reasoning through Generative Credit Assignment [44.33395106709674]
Reinforcement Learning with Verifiable Rewards (RLVR) has improved the reasoning abilities of Large Language Models (LLMs) by using rule-based binary feedback.<n>Current RLVR methods typically assign the same reward to every token.<n>This coarse-grained feedback hampers precise credit assignment, making it hard for models to identify which reasoning steps lead to success or failure.
arXiv Detail & Related papers (2025-08-04T11:06:08Z)
One Token to Fool LLM-as-a-Judge [52.45386385722788]
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models.<n>We uncover a critical vulnerability even in this reference-based paradigm: generative reward models are systematically susceptible to reward hacking.
arXiv Detail & Related papers (2025-07-11T17:55:22Z)
Critical Tokens Matter: Token-Level Contrastive Estimation Enhances LLM's Reasoning Capability [53.51560766150442]
Critical tokens are elements within reasoning trajectories that significantly influence incorrect outcomes.<n>We present a novel framework for identifying these tokens through rollout sampling.<n>We show that identifying and replacing critical tokens significantly improves model accuracy.
arXiv Detail & Related papers (2024-11-29T18:58:22Z)
Are LLMs Better than Reported? Detecting Label Errors and Mitigating Their Effect on Model Performance [28.524573212179124]
Large language models (LLMs) offer new opportunities to enhance the annotation process.<n>We compare expert, crowd-sourced, and LLM-based annotations in terms of the agreement, label quality, and efficiency.<n>Our findings reveal a substantial number of label errors, which, when corrected, a significant upward shift in reported model performance.
arXiv Detail & Related papers (2024-10-24T16:27:03Z)
Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)<n>RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.<n>Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z)
Improving LLM Reasoning through Scaling Inference Computation with Collaborative Verification [52.095460362197336]
Large language models (LLMs) struggle with consistent and accurate reasoning. LLMs are trained primarily on correct solutions, reducing their ability to detect and learn from errors. We propose a novel collaborative method integrating Chain-of-Thought (CoT) and Program-of-Thought (PoT) solutions for verification.
arXiv Detail & Related papers (2024-10-05T05:21:48Z)
LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement [93.38736019287224]
"LLMs-as-Instructors" framework autonomously enhances the training of smaller target models. Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model. Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors.
arXiv Detail & Related papers (2024-06-29T17:16:04Z)
Investigating Automatic Scoring and Feedback using Large Language Models [46.1232919707345]
This paper explores the efficacy of PEFT-based quantized models, employing classification or regression head, to fine-tune language models for automatic grading and feedback generation. The results show that prediction of grade scores via finetuned LLMs are highly accurate, achieving less than 3% error in grade percentage on average.
arXiv Detail & Related papers (2024-05-01T16:13:54Z)
Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions. Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions. To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z)
OVM, Outcome-supervised Value Models for Planning in Mathematical Reasoning [15.59540726867483]
We argue that in guided decoding, assessing the potential of an incomplete reasoning path can be more advantageous than simply ensuring per-step correctness. Inspired by the findings that $textitoutcome supervision for guided decoding essentially acts as a value model, we propose Outcome-supervised Value Model (OVM) Our experiments on two multi-step mathematical reasoning datasets, GSM8K and Game of 24, demonstrate the superior performance of the OVM model.
arXiv Detail & Related papers (2023-11-16T09:56:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.