Overton Pluralistic Reinforcement Learning for Large Language Models
- URL: http://arxiv.org/abs/2602.20759v1
- Date: Tue, 24 Feb 2026 10:39:27 GMT
- Title: Overton Pluralistic Reinforcement Learning for Large Language Models
- Authors: Yu Fu, Seongho Son, Ilija Bogunovic,
- Abstract summary: This paper introduces OP-GRPO, a reinforcement learning framework for implicit Overton Pluralism.<n>It produces pluralistic responses without explicit prompting or modular orchestration.<n> Empirical results demonstrate a "small models, big perspective coverage" effect.
- Score: 15.401087861313547
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Existing alignment paradigms remain limited in capturing the pluralistic nature of human values. Overton Pluralism addresses this gap by generating responses with diverse perspectives from a single query. This paper introduces OP-GRPO (Overton Pluralistic Group Relative Policy Optimization), a reinforcement learning framework for implicit Overton Pluralism that enables a single large language model to produce pluralistic responses without explicit prompting or modular orchestration. Our workflow consists of two main steps. First, similarity estimator training fine-tunes a Sentence Transformer for Overton Pluralism tasks to provide more accurate coverage evaluation of generated responses. Second, OP-GRPO training incorporates this similarity estimator into a dual-reward system designed to ensure both broad coverage of genuine human perspectives and the uniqueness of each perspective, thereby promoting diversity. Empirical results demonstrate a "small models, big perspective coverage" effect. The trained Qwen2.5-3B-Instruct model surpasses a 20B GPT-OSS baseline with a 37.4 percent relative accuracy gain on a Natural Language Inference benchmark, and also outperforms a modular architecture baseline with a 19.1 percent relative improvement. Additional evaluations using GPT-4.1 as a large language model judge further confirm the robustness of the approach.
Related papers
- UniG2U-Bench: Do Unified Models Advance Multimodal Understanding? [50.92401586025528]
Unified multimodal models have recently demonstrated strong generative capabilities, yet whether and when generation improves understanding remains unclear.<n>We introduce UniG2U-Bench, a comprehensive benchmark categorizing generation-to-understanding (G2U) evaluation into 7 regimes and 30 subtasks.
arXiv Detail & Related papers (2026-03-03T18:36:16Z) - Towards Low-Resource Alignment to Diverse Perspectives with Sparse Feedback [13.065059683491958]
We aim to enhance pluralistic alignment of language models in a low-resource setting with two methods: pluralistic decoding and model steering.<n>Our proposed methods decrease false positives in several high-stakes tasks such as hate speech detection and misinformation detection.<n>We hope our work highlights the importance of diversity and how language models can be adapted to consider nuanced perspectives.
arXiv Detail & Related papers (2025-10-17T23:06:21Z) - Aligning Multilingual Reasoning with Verifiable Semantics from a High-Resource Expert Model [13.788758077632432]
We introduce Pivot-Based Reinforcement Learning with Semantically Verifiable Rewards.<n>This framework enhances multilingual reasoning by circumventing the need for human-annotated data in target languages.<n>We show that our method significantly narrows the performance gap between English and other languages.
arXiv Detail & Related papers (2025-09-29T22:03:11Z) - Perception-Aware Policy Optimization for Multimodal Reasoning [79.56070395437898]
A major source of error in current multimodal reasoning lies in the perception of visual inputs.<n>We propose PAPO, a novel policy gradient algorithm that encourages the model to learn to perceive while learning to reason.<n>We observe a substantial reduction of 30.5% in perception errors, indicating improved perceptual capabilities with PAPO.
arXiv Detail & Related papers (2025-07-08T23:22:34Z) - Optimizing Safe and Aligned Language Generation: A Multi-Objective GRPO Approach [2.8626097661711394]
Reinforcement Learning from Human Feedback has achieved notable success in steering models, but is complex and can be unstable.<n>Recent approaches such as Direct Preference Optimization (DPO) simplify preference-based fine-tuning but may introduce bias or trade-off certain objectives.<n>We propose a Group Relative Policy Optimization framework with a multi-label reward regression model to achieve safe and aligned language generation.
arXiv Detail & Related papers (2025-03-26T05:50:33Z) - UniBERT: Adversarial Training for Language-Universal Representations [2.294953003828613]
UniBERT is a compact multilingual language model that uses an innovative training framework that integrates three components: masked language modeling, adversarial training, and knowledge distillation.<n>UniBERT is designed to reduce the computational demands of large-scale models while maintaining competitive performance across various natural language processing tasks.
arXiv Detail & Related papers (2025-03-16T18:44:06Z) - Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks [112.6716697906318]
We present Dynamic-SUPERB Phase-2, an open benchmark for the comprehensive evaluation of instruction-based universal speech models.<n>Building upon the first generation, this second version incorporates 125 new tasks, expanding the benchmark to a total of 180 tasks.<n> Evaluation results show that no model performed well universally.
arXiv Detail & Related papers (2024-11-08T06:33:22Z) - Conditional Language Policy: A General Framework for Steerable Multi-Objective Finetuning [72.46388818127105]
Conditional Language Policy (CLP) is a framework for finetuning language models on multiple objectives.
We show that CLP learns steerable models that effectively trade-off conflicting objectives at inference time.
arXiv Detail & Related papers (2024-07-22T16:13:38Z) - Aligning Large Language Models by On-Policy Self-Judgment [49.31895979525054]
Existing approaches for aligning large language models with human preferences face a trade-off that requires a separate reward model (RM) for on-policy learning.
We present a novel alignment framework, SELF-JUDGE, that does on-policy learning and is parameter efficient.
We show that the rejecting sampling by itself can improve performance further without an additional evaluator.
arXiv Detail & Related papers (2024-02-17T11:25:26Z) - MaxMin-RLHF: Alignment with Diverse Human Preferences [101.57443597426374]
Reinforcement Learning from Human Feedback (RLHF) aligns language models to human preferences by employing a singular reward model derived from preference data.<n>We learn a mixture of preference distributions via an expectation-maximization algorithm to better represent diverse human preferences.<n>Our algorithm achieves an average improvement of more than 16% in win-rates over conventional RLHF algorithms.
arXiv Detail & Related papers (2024-02-14T03:56:27Z) - BatGPT: A Bidirectional Autoregessive Talker from Generative Pre-trained
Transformer [77.28871523946418]
BatGPT is a large-scale language model designed and trained jointly by Wuhan University and Shanghai Jiao Tong University.
It is capable of generating highly natural and fluent text in response to various types of input, including text prompts, images, and audio.
arXiv Detail & Related papers (2023-07-01T15:10:01Z) - Unifying Language Learning Paradigms [96.35981503087567]
We present a unified framework for pre-training models that are universally effective across datasets and setups.
We show how different pre-training objectives can be cast as one another and how interpolating between different objectives can be effective.
Our model also achieve strong results at in-context learning, outperforming 175B GPT-3 on zero-shot SuperGLUE and tripling the performance of T5-XXL on one-shot summarization.
arXiv Detail & Related papers (2022-05-10T19:32:20Z) - Multi-Task and Multi-Corpora Training Strategies to Enhance
Argumentative Sentence Linking Performance [4.374417345150659]
We improve a state-of-the-art linking model by using multi-task and multi-corpora training strategies.
Our auxiliary tasks help the model to learn the role of each sentence in the argumentative structure.
Experiments on essays written by English-as-a-foreign-language learners show that both strategies significantly improve the model's performance.
arXiv Detail & Related papers (2021-09-27T14:17:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.