Related papers: Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains

Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains

URL: http://arxiv.org/abs/2503.23829v2
Date: Tue, 01 Apr 2025 14:48:02 GMT
Title: Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains
Authors: Yi Su, Dian Yu, Linfeng Song, Juntao Li, Haitao Mi, Zhaopeng Tu, Min Zhang, Dong Yu,
Abstract summary: Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs)<n>We investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education.<n>We utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications.
Score: 92.36624674516553
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs), especially when structured reference answers are accessible for verification. However, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education, where structured reference answers are typically unavailable. We reveal that binary verification judgments on broad-domain tasks exhibit high consistency across various LLMs provided expert-written reference answers exist. Motivated by this finding, we utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications, especially in free-form, unstructured answer scenarios. We further demonstrate the feasibility of training cross-domain generative reward models using relatively small (7B) LLMs without the need for extensive domain-specific annotation. Through comprehensive experiments, our RLVR framework establishes clear performance gains, significantly outperforming state-of-the-art open-source aligned models such as Qwen2.5-72B and DeepSeek-R1-Distill-Qwen-32B across domains in free-form settings. Our approach notably enhances the robustness, flexibility, and scalability of RLVR, representing a substantial step towards practical reinforcement learning applications in complex, noisy-label scenarios.

Related papers

SRPO: A Cross-Domain Implementation of Large-Scale Reinforcement Learning on LLM [18.275547804539016]
Two-Staged history-Resampling Policy optimization surpasses the performance of DeepSeek-R1-Zero-32B on the AIME24 and LiveCodeBench benchmarks. We introduce two key methodological innovations: (1) a two-stage cross-domain training paradigm designed to balance the development of mathematical reasoning and coding proficiency, and (2) History Resampling (HR), a technique to address ineffective samples.
arXiv Detail & Related papers (2025-04-19T13:06:03Z)
Rethinking RL Scaling for Vision Language Models: A Transparent, From-Scratch Framework and Comprehensive Evaluation Scheme [36.34443944082215]
This work introduces a transparent, from-scratch framework forReinforcement learning (RL) in vision-based models (VLMs) It offers a minimal yet functional four-step pipeline validated across multiple models and datasets. In addition, a standardized evaluation scheme is proposed to assess training dynamics and reflective behaviors.
arXiv Detail & Related papers (2025-04-03T13:53:28Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding. It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions. Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT) Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning [56.97799347091435]
We present MM-Eureka, a multimodal reasoning model that successfully extends large-scale rule-based reinforcement learning (RL) to multimodal reasoning.<n>Our work reproduces key characteristics of text-based RL systems like DeepSeek-R1 in the multimodal space.<n>We demonstrate that both instruction-tuned and pre-trained models can develop strong multimodal reasoning capabilities through rule-based RL without supervised fine-tuning.
arXiv Detail & Related papers (2025-03-10T14:23:12Z)
R1-Searcher: Incentivizing the Search Capability in LLMs via Reinforcement Learning [87.30285670315334]
textbfR1-Searcher is a novel two-stage outcome-based RL approach designed to enhance the search capabilities of Large Language Models. Our framework relies exclusively on RL, without requiring process rewards or distillation for a cold start. Our experiments demonstrate that our method significantly outperforms previous strong RAG methods, even when compared to the closed-source GPT-4o-mini.
arXiv Detail & Related papers (2025-03-07T17:14:44Z)
Where Do We Stand with Implicit Neural Representations? A Technical and Performance Survey [16.89460694470542]
Implicit Neural Representations (INRs) have emerged as a paradigm in knowledge representation. INRs leverage multilayer perceptrons (MLPs) to model data as continuous implicit functions. This survey introduces a clear taxonomy that categorises them into four key areas: activation functions, position encoding, combined strategies, and network structure.
arXiv Detail & Related papers (2024-11-06T06:14:24Z)
RoRA-VLM: Robust Retrieval-Augmented Vision Language Models [41.09545760534495]
RORA-VLM is a novel and robust retrieval augmentation framework specifically tailored for vision-language models. We conduct extensive experiments to validate the effectiveness and robustness of our proposed methods on three widely adopted benchmark datasets.
arXiv Detail & Related papers (2024-10-11T14:51:00Z)
Exploring Language Model Generalization in Low-Resource Extractive QA [57.14068405860034]
We investigate Extractive Question Answering (EQA) with Large Language Models (LLMs) under domain drift. We devise a series of experiments to explain the performance gap empirically.
arXiv Detail & Related papers (2024-09-27T05:06:43Z)
Retrieval Meets Reasoning: Even High-school Textbook Knowledge Benefits Multimodal Reasoning [49.3242278912771]
We introduce a novel multimodal RAG framework named RMR (Retrieval Meets Reasoning) The RMR framework employs a bi-modal retrieval module to identify the most relevant question-answer pairs. It significantly boosts the performance of various vision-language models across a spectrum of benchmark datasets.
arXiv Detail & Related papers (2024-05-31T14:23:49Z)
RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning [50.55776190278426]
Extrinsic rewards can effectively guide reinforcement learning (RL) agents in specific tasks. We introduce RLeXplore, a unified, highly modularized, and plug-and-play framework offering reliable implementations of eight state-of-the-art intrinsic reward methods.
arXiv Detail & Related papers (2024-05-29T22:23:20Z)
REBEL: Reinforcement Learning via Regressing Relative Rewards [59.68420022466047]
We propose REBEL, a minimalist RL algorithm for the era of generative models.<n>In theory, we prove that fundamental RL algorithms like Natural Policy Gradient can be seen as variants of REBEL.<n>We find that REBEL provides a unified approach to language modeling and image generation with stronger or similar performance as PPO and DPO.
arXiv Detail & Related papers (2024-04-25T17:20:45Z)
An Empirical Study of Automated Vulnerability Localization with Large Language Models [21.84971967029474]
Large Language Models (LLMs) have shown potential in various domains, yet their effectiveness in vulnerability localization remains underexplored. Our investigation encompasses 10+ leading LLMs suitable for code analysis, including ChatGPT and various open-source models. We explore the efficacy of these LLMs using 4 distinct paradigms: zero-shot learning, one-shot learning, discriminative fine-tuning, and generative fine-tuning.
arXiv Detail & Related papers (2024-03-30T08:42:10Z)
PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models [13.313186665410486]
Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Existing reward finetuning methods are limited by their instability in large-scale prompt datasets. We propose Proximal Reward Difference Prediction (PRDP) to enable stable black-box reward finetuning for diffusion models.
arXiv Detail & Related papers (2024-02-13T18:58:16Z)
Deep Reinforcement Learning from Hierarchical Preference Design [99.46415116087259]
This paper shows by exploiting certain structures, one can ease the reward design process. We propose a hierarchical reward modeling framework -- HERON for scenarios: (I) The feedback signals naturally present hierarchy; (II) The reward is sparse, but with less important surrogate feedback to help policy learning.
arXiv Detail & Related papers (2023-09-06T00:44:29Z)
A Neuromorphic Architecture for Reinforcement Learning from Real-Valued Observations [0.34410212782758043]
Reinforcement Learning (RL) provides a powerful framework for decision-making in complex environments. This paper presents a novel Spiking Neural Network (SNN) architecture for solving RL problems with real-valued observations.
arXiv Detail & Related papers (2023-07-06T12:33:34Z)
Information Directed Reward Learning for Reinforcement Learning [64.33774245655401]
We learn a model of the reward function that allows standard RL algorithms to achieve high expected return with as few expert queries as possible. In contrast to prior active reward learning methods designed for specific types of queries, IDRL naturally accommodates different query types. We support our findings with extensive evaluations in multiple environments and with different types of queries.
arXiv Detail & Related papers (2021-02-24T18:46:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.