Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
- URL: http://arxiv.org/abs/2509.13081v1
- Date: Tue, 16 Sep 2025 13:39:29 GMT
- Title: Shaping Explanations: Semantic Reward Modeling with Encoder-Only Transformers for GRPO
- Authors: Francesco Pappone, Ruggero Marino Lazzaroni, Federico Califano, Niccolò Gentile, Roberto Marras,
- Abstract summary: We introduce a novel approach to reward shaping within the Group Relative Policy optimisation framework.<n>Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model.<n>We apply this method to the task of training a model for the Italian medical-school entrance examinations.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Large Language Models (LLMs) excel at generating human-like text, aligning their outputs with complex, qualitative goals like pedagogical soundness remains a significant challenge. Standard reinforcement learning techniques often rely on slow and expensive LLM-as-a-judge evaluations or on brittle, keyword-based metrics like ROUGE, which fail to capture the semantic essence of a high-quality explanation. In this work, we introduce a novel approach to reward shaping within the Group Relative Policy Optimisation (GRPO) framework. Our central contribution is the use of a small, efficient encoder-only transformer as a semantic reward model. This model provides a dense, semantically rich reward signal based on the cosine similarity between a generated explanation and a ground-truth reference, guiding the policy towards explanations that are not just factually correct but also structurally and conceptually aligned with expert reasoning. We apply this method to the task of training a model for the Italian medical-school entrance examinations, following standard domain-adaptive continued pre-training (CPT) and supervised fine-tuning (SFT). Our results demonstrate that GRPO with our proposed semantic reward significantly improves explanation faithfulness and clarity over a strong SFT baseline, showcasing the power of using lightweight encoder models for nuanced reward shaping in complex generation tasks
Related papers
- Owen-Shapley Policy Optimization (OSPO): A Principled RL Algorithm for Generative Search LLMs [12.75200353208858]
Owen-Shapley Policy Optimization (OSPO) is a framework that redistributes sequence-level advantages based on tokens' marginal contributions to outcomes.<n>Unlike value-model-based methods requiring additional computation, OSPO employs potential-based reward shaping via Shapley-Owen attributions to assign segment-level credit.<n> Experiments on Amazon ESCI and H&M Fashion datasets show consistent gains over baselines.
arXiv Detail & Related papers (2026-01-13T10:17:46Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - Reasoning Palette: Modulating Reasoning via Latent Contextualization for Controllable Exploration for (V)LMs [49.66344956133349]
Reasoning capacity shapes both inference-time performance and reinforcement learning (RL) training for large (vision-) language models.<n>This paper proposes Reasoning Palette, a novel latent-modulation framework that endows the model with a latent variable for strategic contextualization.
arXiv Detail & Related papers (2025-12-19T03:32:53Z) - A First-Order Logic-Based Alternative to Reward Models in RLHF [0.0]
Reinforcement Learning from Human Feedback plays a crucial role in aligning large language models with human values and preferences.<n>Existing approaches rely heavily on reward models to guide language models toward human-aligned behaviors.<n>We propose a logic-similarity-based reward mechanism as an alternative to conventional reward modeling.
arXiv Detail & Related papers (2025-12-16T05:15:17Z) - Parent-Guided Semantic Reward Model (PGSRM): Embedding-Based Reward Functions for Reinforcement Learning of Transformer Language Models [51.56484100374058]
We introduce the Parent-Guided Semantic Reward Model (PGSRM)<n>PGSRM replaces binary correctness signals, human preference data, and trained reward models with a simple signal.<n>We find that PGSRM produces smoother reward improvement and more stable PPO dynamics than a binary reward baseline.
arXiv Detail & Related papers (2025-12-07T16:58:22Z) - Multi-Reward GRPO for Stable and Prosodic Single-Codebook TTS LLMs at Scale [12.626090218930578]
Single-codebook text-to-speech models often exhibit unstable prosody, speaker drift, and degraded naturalness.<n>We propose a multi-reward Group Relative Policy Optimization framework that directly optimize the token generation policy of single-codebook TTS LLMs.<n>We show that the proposed method consistently enhances prosodic stability, speaker similarity, and overall speech naturalness in single-codebook TTS LLMs.
arXiv Detail & Related papers (2025-11-26T10:50:17Z) - ENIGMA: The Geometry of Reasoning and Alignment in Large-Language Models [0.0]
We present Entropic Mutual-Information Geometry Large-Language Model Alignment (ENIGMA)<n>It improves reasoning, alignment and robustness by treating an organisation's policies/principles as directions to move on a model's information manifold.
arXiv Detail & Related papers (2025-10-13T11:13:09Z) - Beyond Imitation: Recovering Dense Rewards from Demonstrations [64.05543657441218]
supervised fine-tuning is treated as a simple imitation learning process that only trains a policy to imitate expert behavior on datasets.<n>We prove that the SFT process does not just learn a policy, but also an implicit, dense, token-level reward model that explains the expert demonstrations.<n>Dense-Path REINFORCE consistently outperforms the original SFT models on instruction-following benchmarks.
arXiv Detail & Related papers (2025-10-02T18:58:26Z) - Better Language Model-Based Judging Reward Modeling through Scaling Comprehension Boundaries [3.930598942647121]
We propose a two-stage LM-based judging reward model that utilizes an explanation based slot framework for prediction.<n>In both reinforcement learning from human feedback (RLHF) and out-of-distribution (OOD) scenarios, the ESFP-RM framework delivers more stable and generalizable reward signals.
arXiv Detail & Related papers (2025-08-25T17:11:28Z) - G$^2$RPO-A: Guided Group Relative Policy Optimization with Adaptive Guidance [1.0591274452539035]
We investigate Guided GRPO, which injects ground-truth reasoning steps into roll-out trajectories.<n>We find that naively adding guidance delivers limited gains.<n>Experiments on mathematical reasoning and code-generation benchmarks confirm that G$2$RPO-A substantially outperforms vanilla GRPO.
arXiv Detail & Related papers (2025-08-18T15:41:16Z) - QA-LIGN: Aligning LLMs through Constitutionally Decomposed QA [46.65999744568314]
We introduce QA-LIGN, an automatic symbolic reward decomposition approach.<n>Instead of training a black-box reward model that outputs a monolithic score, QA-LIGN formulates principle-specific evaluation questions.<n>Experiments aligning an uncensored large language model with a set of constitutional principles demonstrate that QA-LIGN offers greater transparency and adaptability.
arXiv Detail & Related papers (2025-06-09T18:24:57Z) - CSE-SFP: Enabling Unsupervised Sentence Representation Learning via a Single Forward Pass [3.0566617373924325]
Recent advances in pre-trained language models (PLMs) have driven remarkable progress in this field.<n>We propose CSE-SFP, an innovative method that exploits the structural characteristics of generative models.<n>We show that CSE-SFP not only produces higher-quality embeddings but also significantly reduces both training time and memory consumption.
arXiv Detail & Related papers (2025-05-01T08:27:14Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - Disentanglement via Latent Quantization [60.37109712033694]
In this work, we construct an inductive bias towards encoding to and decoding from an organized latent space.
We demonstrate the broad applicability of this approach by adding it to both basic data-re (vanilla autoencoder) and latent-reconstructing (InfoGAN) generative models.
arXiv Detail & Related papers (2023-05-28T06:30:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.