Related papers: Lessons from Training Grounded LLMs with Verifiable Rewards

Lessons from Training Grounded LLMs with Verifiable Rewards

URL: http://arxiv.org/abs/2506.15522v1
Date: Wed, 18 Jun 2025 14:58:13 GMT
Title: Lessons from Training Grounded LLMs with Verifiable Rewards
Authors: Shang Hong Sim, Tej Deep Pala, Vernon Toh, Hai Leong Chieu, Amir Zadeh, Chuan Li, Navonil Majumder, Soujanya Poria,
Abstract summary: Reinforcement learning and internal reasoning can enhance grounding in large language models.<n>We show that reasoning-augmented models significantly outperform instruction-only variants.<n>A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding.
Score: 24.35637263339965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating grounded and trustworthy responses remains a key challenge for large language models (LLMs). While retrieval-augmented generation (RAG) with citation-based grounding holds promise, instruction-tuned models frequently fail even in straightforward scenarios: missing explicitly stated answers, citing incorrectly, or refusing when evidence is available. In this work, we explore how reinforcement learning (RL) and internal reasoning can enhance grounding in LLMs. We use the GRPO (Group Relative Policy Optimization) method to train models using verifiable outcome-based rewards targeting answer correctness, citation sufficiency, and refusal quality, without requiring gold reasoning traces or expensive annotations. Through comprehensive experiments across ASQA, QAMPARI, ELI5, and ExpertQA we show that reasoning-augmented models significantly outperform instruction-only variants, especially in handling unanswerable queries and generating well-cited responses. A two-stage training setup, first optimizing answer and citation behavior and then refusal, further improves grounding by stabilizing the learning signal. Additionally, we revisit instruction tuning via GPT-4 distillation and find that combining it with GRPO enhances performance on long-form, generative QA tasks. Overall, our findings highlight the value of reasoning, stage-wise optimization, and outcome-driven RL for building more verifiable and reliable LLMs.

Related papers

ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation [54.071574153853994]
ProRAG is a process-supervised reinforcement learning framework designed to integrate learned step-level supervision into the online optimization loop.<n>Our framework consists of four stages: (1) Supervised Policy Warmup to initialize the model with a structured reasoning format; (2) construction of an MCTS-based Process Reward Model (PRM) to quantify intermediate reasoning quality; (3) PRM-Guided Reasoning Refinement to align the policy with fine-grained process preferences; and (4) Process-Supervised Reinforcement Learning with a dual-granularity advantage mechanism.
arXiv Detail & Related papers (2026-01-29T16:04:59Z)
From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation [52.62655622099456]
We propose reinforcement learning with verifiable reference-based rewards (RLVRR)<n>Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e., reward chain)<n>In this way, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts, and style, which evaluates adherence to stylistic properties.
arXiv Detail & Related papers (2026-01-26T14:39:58Z)
When Actions Teach You to Think: Reasoning-Action Synergy via Reinforcement Learning in Conversational Agents [2.689316553293938]
Supervised fine-tuning (SFT) has emerged as one of the most effective ways to improve the performance of large language models (LLMs) in downstream tasks.<n>We propose a pipeline in which LLMs generate reasoning steps that guide both the invocation of tools and the final answer generation for conversational agents.
arXiv Detail & Related papers (2025-12-12T04:44:40Z)
Explore Data Left Behind in Reinforcement Learning for Reasoning Language Models [61.78513830395669]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an effective approach for improving the reasoning abilities of large language models (LLMs)<n>As models train longer and scale larger, more training prompts become residual prompts, those with zero variance rewards that provide no training signal.<n>We propose the Explore Residual Prompts in Policy Optimization framework, which encourages exploration on residual prompts and reactivates their training signals.
arXiv Detail & Related papers (2025-11-06T20:40:27Z)
Post-Training Large Language Models via Reinforcement Learning from Self-Feedback [3.73824942136665]
Large Language Models (LLMs) often produce plausible but poorly-calibrated answers.<n>We present Reinforcement Learning from Self-Feedback (RLSF), a post-training stage that uses the model's own confidence as an intrinsic reward.
arXiv Detail & Related papers (2025-07-29T15:46:26Z)
ExPO: Unlocking Hard Reasoning with Self-Explanation-Guided Reinforcement Learning [12.83211408922535]
Reinforcement learning-style post-training improves reasoning by optimizing model outputs based on reward or preference signals.<n> GRPO-style approaches implement this by using self-generated samples labeled by an outcome-based verifier.<n>We propose $textbfSelf-Explanation Policy Optimization (ExPO)$-a simple and modular framework that generates such samples by conditioning on the ground-truth answer.
arXiv Detail & Related papers (2025-07-03T17:44:55Z)
SeRL: Self-Play Reinforcement Learning for Large Language Models with Limited Data [65.56911325914582]
We propose Self-play Reinforcement Learning (SeRL) to bootstrap Large Language Models (LLMs) training with limited initial data.<n>The proposed SeRL yields results superior to its counterparts and achieves performance on par with those obtained by high-quality data with verifiable rewards.
arXiv Detail & Related papers (2025-05-25T13:28:04Z)
Bridging Supervised Learning and Reinforcement Learning in Math Reasoning [55.889740979706815]
Reinforcement Learning (RL) has played a central role in the recent surge of math abilities by enabling self-improvement through binary verifier signals.<n>In this work, we propose Negative-aware Fine-Tuning (NFT) -- a supervised approach that enables LLMs to reflect on their failures and improve autonomously with no external teachers.
arXiv Detail & Related papers (2025-05-23T17:17:40Z)
Trust, But Verify: A Self-Verification Approach to Reinforcement Learning with Verifiable Rewards [67.86091419220816]
Large Language Models (LLMs) show great promise in complex reasoning.<n>A prevalent issue is superficial self-reflection'', where models fail to robustly verify their own outputs.<n>We introduce RISE (Reinforcing Reasoning with Self-Verification), a novel online RL framework designed to tackle this.
arXiv Detail & Related papers (2025-05-19T17:59:31Z)
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
Post-training an LLM for RAG? Train on Self-Generated Demonstrations [19.972220654354494]
Large language models (LLMs) often struggle with knowledge intensive NLP tasks.<n>Retrieval augmented generation (RAG) allows the model to leverage in-context information.<n>We propose a recipe for training RAG-enabled LLMs using self-generated demonstrations.
arXiv Detail & Related papers (2025-02-14T23:00:49Z)
Enhancing Large Language Model Performance To Answer Questions and Extract Information More Accurately [2.1715455600756646]
Large Language Models (LLMs) generate responses to questions. Their effectiveness is often hindered by sub-optimal quality of answers and occasional failures to provide accurate responses to questions. To address these challenges, a fine-tuning process is employed, involving feedback and examples to refine models.
arXiv Detail & Related papers (2024-01-27T00:18:07Z)
Effective Large Language Model Adaptation for Improved Grounding and Citation Generation [48.07830615309543]
This paper focuses on improving large language models (LLMs) by grounding their responses in retrieved passages and by providing citations. We propose a new framework, AGREE, that improves the grounding from a holistic perspective. Our framework tunes LLMs to selfground the claims in their responses and provide accurate citations to retrieved documents.
arXiv Detail & Related papers (2023-11-16T03:22:25Z)
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG) Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection. It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.