A Rubric-Supervised Critic from Sparse Real-World Outcomes
- URL: http://arxiv.org/abs/2603.03800v1
- Date: Wed, 04 Mar 2026 07:23:54 GMT
- Title: A Rubric-Supervised Critic from Sparse Real-World Outcomes
- Authors: Xingyao Wang, Valerie Chen, Heng Ji, Graham Neubig,
- Abstract summary: Real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse.<n>We propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling.
- Score: 87.11204512676193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Academic benchmarks for coding agents tend to reward autonomous task completion, measured by verifiable rewards such as unit-test success. In contrast, real-world coding agents operate with humans in the loop, where success signals are typically noisy, delayed, and sparse. How can we bridge this gap? In this paper, we propose a process to learn a "critic" model from sparse and noisy interaction data, which can then be used both as a reward model for either RL-based training or inference-time scaling. Specifically, we introduce Critic Rubrics, a rubric-based supervision framework with 24 behavioral features that can be derived from human-agent interaction traces alone. Using a semi-supervised objective, we can then jointly predict these rubrics and sparse human feedback (when present). In experiments, we demonstrate that, despite being trained primarily from trace-observable rubrics and sparse real-world outcome proxies, these critics improve best-of-N reranking on SWE-bench (Best@8 +15.9 over Random@8 over the rerankable subset of trajectories), enable early stopping (+17.7 with 83% fewer attempts), and support training-time data curation via critic-selected trajectories.
Related papers
- Reward Modeling from Natural Language Human Feedback [77.75758630455357]
Reinforcement Learning with Verifiable reward (RLVR) on preference data has become the mainstream approach for training Generative Reward Models (GRMs)<n>In this paper, we demonstrate that such binary classification tasks make GRMs susceptible to guessing correct outcomes without sound critiques.<n>We propose Reward Modeling from Natural Language Human Feedback (RM-NLHF), which leverages natural language feedback to obtain process reward signals.
arXiv Detail & Related papers (2026-01-12T09:23:43Z) - VARP: Reinforcement Learning from Vision-Language Model Feedback with Agent Regularized Preferences [13.337649128532307]
Preference-based RL mitigates some of these pitfalls by learning rewards from comparative feedback.<n>A single final-state image generally fails to capture the agent's full motion.<n>We present a two-part solution that both improves feedback accuracy and better aligns reward learning with the agent's policy.
arXiv Detail & Related papers (2025-03-18T01:51:27Z) - Agent-R: Training Language Model Agents to Reflect via Iterative Self-Training [18.896813839389893]
We propose an iterative self-training framework, Agent-R, that enables language Agent to Reflect on the fly.<n>Unlike traditional methods that reward or penalize actions based on correctness, Agent-R leverages MCTS to construct training data that recover correct trajectories from erroneous ones.<n>Our findings demonstrate that Agent-R continuously improves the model's ability to recover from errors and enables timely error correction.
arXiv Detail & Related papers (2025-01-20T11:46:04Z) - Weak Reward Model Transforms Generative Models into Robust Causal Event Extraction Systems [17.10762463903638]
We train evaluation models to approximate human evaluation, achieving high agreement.
We propose a weak-to-strong supervision method that uses a fraction of the annotated data to train an evaluation model.
arXiv Detail & Related papers (2024-06-26T10:48:14Z) - SMaRt: Improving GANs with Score Matching Regularity [114.43433222721025]
Generative adversarial networks (GANs) usually struggle in learning from highly diverse data, whose underlying manifold is complex.<n>We find that score matching serves as a promising solution to this issue thanks to its capability of persistently pushing the generated data points towards the real data manifold.<n>We show that our approach can consistently boost the performance of various state-of-the-art GANs on real-world datasets with pre-trained diffusion models acting as the approximate score function.
arXiv Detail & Related papers (2023-11-30T03:05:14Z) - Unsupervised Dense Retrieval with Relevance-Aware Contrastive
Pre-Training [81.3781338418574]
We propose relevance-aware contrastive learning.
We consistently improve the SOTA unsupervised Contriever model on the BEIR and open-domain QA retrieval benchmarks.
Our method can not only beat BM25 after further pre-training on the target corpus but also serves as a good few-shot learner.
arXiv Detail & Related papers (2023-06-05T18:20:27Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - RethinkCWS: Is Chinese Word Segmentation a Solved Task? [81.11161697133095]
The performance of the Chinese Word (CWS) systems has gradually reached a plateau with the rapid development of deep neural networks.
In this paper, we take stock of what we have achieved and rethink what's left in the CWS task.
arXiv Detail & Related papers (2020-11-13T11:07:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.