Related papers: RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation

URL: http://arxiv.org/abs/2601.08430v1
Date: Tue, 13 Jan 2026 10:56:39 GMT
Title: RubricHub: A Comprehensive and Highly Discriminative Rubric Dataset via Automated Coarse-to-Fine Generation
Authors: Sunzhu Li, Jiale Zhao, Miteto Wei, Huimin Ren, Yang Zhou, Jingwen Yang, Shunyu Liu, Kaike Zhang, Wei Chen,
Abstract summary: Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics.<n>Existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect.<n>We propose an automated Coarse-to-Fine Generation framework that produces comprehensive and highly discnative criteria.
Score: 11.664443383764448
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) has driven substantial progress in reasoning-intensive domains like mathematics. However, optimizing open-ended generation remains challenging due to the lack of ground truth. While rubric-based evaluation offers a structured proxy for verification, existing methods suffer from scalability bottlenecks and coarse criteria, resulting in a supervision ceiling effect. To address this, we propose an automated Coarse-to-Fine Rubric Generation framework. By synergizing principle-guided synthesis, multi-model aggregation, and difficulty evolution, our approach produces comprehensive and highly discriminative criteria capable of capturing the subtle nuances. Based on this framework, we introduce RubricHub, a large-scale ($\sim$110k) and multi-domain dataset. We validate its utility through a two-stage post-training pipeline comprising Rubric-based Rejection Sampling Fine-Tuning (RuFT) and Reinforcement Learning (RuRL). Experimental results demonstrate that RubricHub unlocks significant performance gains: our post-trained Qwen3-14B achieves state-of-the-art (SOTA) results on HealthBench (69.3), surpassing proprietary frontier models such as GPT-5. The code and data will be released soon.

Related papers

Benchmarking Few-shot Transferability of Pre-trained Models with Improved Evaluation Protocols [123.73663884421272]
Few-shot transfer has been revolutionized by stronger pre-trained models and improved adaptation algorithms.<n>We establish FEWTRANS, a comprehensive benchmark containing 10 diverse datasets.<n>By releasing FEWTRANS, we aim to provide a rigorous "ruler" to streamline reproducible advances in few-shot transfer learning research.
arXiv Detail & Related papers (2026-02-28T05:41:57Z)
ReSyn: Autonomously Scaling Synthetic Environments for Reasoning Models [18.359969463106644]
Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising approach for training reasoning language models (RLMs)<n>In this work, we scale RLVR by introducing ReSyn, a pipeline that generates diverse reasoning environments equipped with instance generators and verifiers.
arXiv Detail & Related papers (2026-02-23T18:34:29Z)
CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z)
Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering [28.35101062722637]
Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs)<n>We propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry.<n>We show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines.
arXiv Detail & Related papers (2026-01-13T10:55:08Z)
Scaf-GRPO: Scaffolded Group Relative Policy Optimization for Enhancing LLM Reasoning [49.290631188365786]
Scaf-GRPO is a training framework that intervenes when a model's independent learning has plateaued.<n>It boosts the pass@1 score of the Qwen2.5-Math-7B model by a relative 44.3% over a vanilla GRPO baseline.<n>This result demonstrates our framework provides a robust and effective methodology for unlocking a model's ability to solve problems previously beyond its reach.
arXiv Detail & Related papers (2025-10-22T17:41:30Z)
Robust and Label-Efficient Deep Waste Detection [29.019461511410515]
Effective waste sorting is critical for sustainable recycling, yet AI research in this domain continues to lag behind commercial systems.<n>In this work, we advance AI-driven waste detection by establishing strong baselines and introducing an ensemble-based semi-supervised learning framework.
arXiv Detail & Related papers (2025-08-26T08:34:04Z)
VERIRL: Boosting the LLM-based Verilog Code Generation via Reinforcement Learning [32.974199255760944]
We introduce a reinforcement learning framework tailored for Verilog code generation.<n>To tackle the problem of sparse and noisy reward signals, we propose a Trace-back based Rescore mechanism.<n>To mitigate catastrophic forgetting and overfitting during RL fine-tuning, we introduce a sample-balanced weighting strategy.
arXiv Detail & Related papers (2025-08-25T20:20:44Z)
Nested-ReFT: Efficient Reinforcement Learning for Large Language Model Fine-Tuning via Off-Policy Rollouts [25.205293698698867]
We introduce Nested-ReFT, where a subset of layers of the target model acts as the behavior model to generate off-policy completions during training.<n>Our theoretical analysis shows that Nested-ReFT yields unbiased gradient estimates with controlled variance.<n>Our empirical analysis demonstrates improved computational efficiency measured as tokens/sec across multiple math reasoning benchmarks and model sizes.
arXiv Detail & Related papers (2025-08-13T18:37:46Z)
Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z)
Noisy Self-Training with Synthetic Queries for Dense Retrieval [49.49928764695172]
We introduce a novel noisy self-training framework combined with synthetic queries. Experimental results show that our method improves consistently over existing methods. Our method is data efficient and outperforms competitive baselines.
arXiv Detail & Related papers (2023-11-27T06:19:50Z)
When Demonstrations Meet Generative World Models: A Maximum Likelihood Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z)
Toward Certified Robustness Against Real-World Distribution Shifts [65.66374339500025]
We train a generative model to learn perturbations from data and define specifications with respect to the output of the learned model. A unique challenge arising from this setting is that existing verifiers cannot tightly approximate sigmoid activations. We propose a general meta-algorithm for handling sigmoid activations which leverages classical notions of counter-example-guided abstraction refinement.
arXiv Detail & Related papers (2022-06-08T04:09:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.