Related papers: Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering

URL: http://arxiv.org/abs/2601.08427v1
Date: Tue, 13 Jan 2026 10:55:08 GMT
Title: Silence the Judge: Reinforcement Learning with Self-Verifier via Latent Geometric Clustering
Authors: Nonghai Zhang, Weitao Ma, Zhanyu Ma, Jun Xu, Jiuchong Gao, Jinghua Hao, Renqing He, Jingwen Xu,
Abstract summary: Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs)<n>We propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry.<n>We show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines.
Score: 28.35101062722637
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Group Relative Policy Optimization (GRPO) significantly enhances the reasoning performance of Large Language Models (LLMs). However, this success heavily relies on expensive external verifiers or human rules. Such dependency not only leads to significant computational costs and training latency, but also yields sparse rewards that hinder optimization efficiency. To address these challenges, we propose Latent-GRPO, a framework that derives intrinsic rewards directly from latent space geometry. Crucially, our empirical analysis reveals a compelling geometric property: terminal token representations of correct reasoning trajectories form dense clusters with high intra-class similarity, whereas incorrect trajectories remain scattered as outliers. In light of this discovery, we introduce the Iterative Robust Centroid Estimation (IRCE) algorithm, which generates dense, continuous rewards by mitigating magnitude fluctuations via spherical projection and estimating a robust ``truth centroid'' through iterative aggregation. Experimental results on multiple datasets show that our method maintains model performance while achieving a training speedup of over 2x compared to baselines. Furthermore, extensive results demonstrate strong generalization ability and robustness. The code will be released soon.

Related papers

GTS: Inference-Time Scaling of Latent Reasoning with a Learnable Gaussian Thought Sampler [54.10960908347221]
We model latent thought exploration as conditional sampling from learnable densities and instantiate this idea as a Gaussian Thought Sampler (GTS)<n>GTS predicts context-dependent perturbation distributions over continuous reasoning states and is trained with GRPO-style policy optimization while keeping the backbone frozen.
arXiv Detail & Related papers (2026-02-15T09:57:47Z)
Towards Sample-Efficient and Stable Reinforcement Learning for LLM-based Recommendation [56.92367609590823]
Long Chain-of-Thought (Long CoT) reasoning has shown promise in Large Language Models (LLMs)<n>We argue that Long CoT is inherently ill-suited for the sequential recommendation domain.<n>We propose RISER, a novel Reinforced Item Space Exploration framework for Recommendation.
arXiv Detail & Related papers (2026-01-31T10:02:43Z)
Latent-Space Contrastive Reinforcement Learning for Stable and Efficient LLM Reasoning [16.244366307890832]
We propose textbfDeepLatent Reasoning (DLR), a latent-space bidirectional contrastive reinforcement learning framework.<n>This framework shifts the trial-and-error cost from expensive token-level full sequence generation to the continuous latent manifold.<n> Experiments demonstrate that DLR achieves more stable training convergence, supports longer-horizon reasoning chains, and facilitates the sustainable accumulation of reasoning capabilities.
arXiv Detail & Related papers (2026-01-24T03:18:22Z)
ArenaRL: Scaling RL for Open-Ended Agents via Tournament-based Relative Ranking [84.07076200941474]
ArenaRL is a reinforcement learning paradigm that shifts from pointwise scalar scoring to intra-group relative ranking.<n>We construct an intra-group adversarial arena and devise a tournament-based ranking scheme to obtain stable advantage signals.<n>Experiments show that ArenaRL substantially outperforms standard RL baselines.
arXiv Detail & Related papers (2026-01-10T08:43:07Z)
VeRPO: Verifiable Dense Reward Policy Optimization for Code Generation [43.206705536310245]
We introduce textbfVeRPO (textbfVerifiable Dtextbfense textbfReward textbfPolicy textbfOptimization), a novel RL framework for code generation that synthesizes textitrobust and dense rewards fully grounded in verifiable execution feedback.<n>VeRPO consistently outperforms outcome-driven and RM-based baselines, achieving up to +8.83% gain in pass@1 with negligible time cost ( 0.02%) and zero
arXiv Detail & Related papers (2026-01-07T02:29:49Z)
ICPO: Intrinsic Confidence-Driven Group Relative Preference Optimization for Efficient Reinforcement Learning [17.98065634130798]
We propose the Intrinsic Confidence-Driven Group Relative Preference Optimization method (ICPO)<n>ICPO calculates a preference advantage score for each response by comparing the relative generation probabilities of multiple responses under the same input prompt.<n>We have discovered that the preference advantage score not only alleviates the issues of coarse-grained rewards and reward noise but also effectively curbs overconfident errors.
arXiv Detail & Related papers (2025-11-26T03:10:15Z)
Repurposing Synthetic Data for Fine-grained Search Agent Supervision [81.95597592711688]
LLM-based search agents are increasingly trained on entity-centric synthetic data.<n> prevailing training methods discard this rich entity information, relying instead on sparse, outcome-based rewards.<n>We introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function.
arXiv Detail & Related papers (2025-10-28T17:50:40Z)
Rewarding the Journey, Not Just the Destination: A Composite Path and Answer Self-Scoring Reward Mechanism for Test-Time Reinforcement Learning [29.778703252962092]
Reinforcement Learning (RL) has emerged as a powerful paradigm for advancing Large Language Models (LLMs)<n>We develop a novel test-time reward mechanism that operates without external supervision.
arXiv Detail & Related papers (2025-10-20T07:53:51Z)
Distributionally Robust Federated Learning with Outlier Resilience [8.69285602685459]
We study distributionally robust federated learning with explicit outlier resilience.<n>We reformulate the problem as a tractable Lagrangian penalty optimization, which admits robustness certificates.<n>Building on this reformulation, we propose the distributionally outlier-robust federated learning algorithm and establish its convergence guarantees.
arXiv Detail & Related papers (2025-09-29T08:42:12Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)
GCC: Generative Calibration Clustering [55.44944397168619]
We propose a novel Generative Clustering (GCC) method to incorporate feature learning and augmentation into clustering procedure. First, we develop a discrimirative feature alignment mechanism to discover intrinsic relationship across real and generated samples. Second, we design a self-supervised metric learning to generate more reliable cluster assignment.
arXiv Detail & Related papers (2024-04-14T01:51:11Z)
Policy Optimization with Smooth Guidance Learned from State-Only Demonstrations [2.709826237514737]
The sparsity of reward feedback remains a challenging problem in online deep reinforcement learning. We propose a simple and efficient algorithm called Policy Optimization with Smooth Guidance (POSG) We show POSG's significant advantages in control performance and convergence speed in four sparse-reward environments.
arXiv Detail & Related papers (2023-12-30T07:41:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.