Related papers: Preference-Guided Learning for Sparse-Reward Multi-Agent Reinforcement Learning

Preference-Guided Learning for Sparse-Reward Multi-Agent Reinforcement Learning

URL: http://arxiv.org/abs/2509.21828v1
Date: Fri, 26 Sep 2025 03:41:40 GMT
Title: Preference-Guided Learning for Sparse-Reward Multi-Agent Reinforcement Learning
Authors: The Viet Bui, Tien Mai, Hong Thanh Nguyen,
Abstract summary: We study the problem of online multi-agent reinforcement learning (MARL) in environments with sparse rewards.<n>The lack of intermediate rewards hinders standard MARL algorithms from effectively guiding policy learning.<n>We propose a novel framework that integrates online inverse preference learning with multi-agent on-policy optimization.
Score: 15.034714081414691
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the problem of online multi-agent reinforcement learning (MARL) in environments with sparse rewards, where reward feedback is not provided at each interaction but only revealed at the end of a trajectory. This setting, though realistic, presents a fundamental challenge: the lack of intermediate rewards hinders standard MARL algorithms from effectively guiding policy learning. To address this issue, we propose a novel framework that integrates online inverse preference learning with multi-agent on-policy optimization into a unified architecture. At its core, our approach introduces an implicit multi-agent reward learning model, built upon a preference-based value-decomposition network, which produces both global and local reward signals. These signals are further used to construct dual advantage streams, enabling differentiated learning targets for the centralized critic and decentralized actors. In addition, we demonstrate how large language models (LLMs) can be leveraged to provide preference labels that enhance the quality of the learned reward model. Empirical evaluations on state-of-the-art benchmarks, including MAMuJoCo and SMACv2, show that our method achieves superior performance compared to existing baselines, highlighting its effectiveness in addressing sparse-reward challenges in online MARL.

Related papers

From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z)
SoliReward: Mitigating Susceptibility to Reward Hacking and Annotation Noise in Video Generation Reward Models [53.19726629537694]
Post-training alignment of video generation models with human preferences is a critical goal.<n>Current data collection paradigms, reliant on in-prompt pairwise annotations, suffer from labeling noise.<n>We propose SoliReward, a systematic framework for video RM training.
arXiv Detail & Related papers (2025-12-17T14:28:23Z)
PROF: An LLM-based Reward Code Preference Optimization Framework for Offline Imitation Learning [29.373324685358753]
We propose PROF, a framework to generate and improve executable reward function codes from natural language descriptions and a single expert trajectory.<n>We also propose Reward Preference Ranking (RPR), a novel reward function quality assessment and ranking strategy without requiring environment interactions or RL training.
arXiv Detail & Related papers (2025-11-14T14:38:02Z)
Online Process Reward Leanring for Agentic Reinforcement Learning [92.26560379363492]
Large language models (LLMs) are increasingly trained with reinforcement learning (RL) as autonomous agents.<n>Recent work attempts to integrate process supervision into agent learning but suffers from biased annotation.<n>We introduce Online Process Reward Learning (OPRL), a general credit-assignment strategy for agentic RL.
arXiv Detail & Related papers (2025-09-23T16:15:42Z)
LLM-Driven Intrinsic Motivation for Sparse Reward Reinforcement Learning [0.27528170226206433]
This paper explores the combination of two intrinsic motivation strategies to improve the efficiency of learning agents in environments with extreme sparse rewards.<n>We propose integrating Variational State as Intrinsic Reward (VSIMR), which uses Variational AutoEncoders (VAEs) reward state novelty, with an intrinsic reward approach derived from Large Language Models (LLMs)<n>Our empirical results show that this combined strategy significantly increases agent performance and efficiency compared to using each strategy individually.
arXiv Detail & Related papers (2025-08-25T19:10:58Z)
Ensemble-MIX: Enhancing Sample Efficiency in Multi-Agent RL Using Ensemble Methods [0.0]
Multi-agent reinforcement learning (MARL) methods have achieved state-of-the-art results on a range of multi-agent tasks.<n>Yet, MARL algorithms require significantly more environment interactions than their single-agent counterparts to converge.<n>We propose a novel algorithm that combines a decomposed centralized critic with decentralized ensemble learning.
arXiv Detail & Related papers (2025-06-03T13:13:15Z)
MisoDICE: Multi-Agent Imitation from Unlabeled Mixed-Quality Demonstrations [5.4482836906033585]
We study offline imitation learning (IL) in cooperative multi-agent settings, where demonstrations have unlabeled mixed quality.<n>Our proposed solution is structured in two stages: trajectory labeling and multi-agent imitation learning.<n>We introduce MisoDICE, a novel multi-agent IL algorithm that leverages these labels to learn robust policies.
arXiv Detail & Related papers (2025-05-24T08:43:42Z)
Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques [65.55451717632317]
We study Preference-Based Multi-Agent Reinforcement Learning (PbMARL)<n>We identify the Nash equilibrium from a preference-only offline dataset in general-sum games.<n>Our findings underscore the multifaceted approach required for PbMARL.
arXiv Detail & Related papers (2024-09-01T13:14:41Z)
Self-Exploring Language Models: Active Preference Elicitation for Online Alignment [88.56809269990625]
We propose a bilevel objective optimistically biased towards potentially high-reward responses to actively explore out-of-distribution regions. Our experimental results demonstrate that when fine-tuned on Zephyr-7B-SFT and Llama-3-8B-Instruct models, Self-Exploring Language Models (SELM) significantly boosts the performance on instruction-following benchmarks.
arXiv Detail & Related papers (2024-05-29T17:59:07Z)
Multi-turn Reinforcement Learning from Preference Human Feedback [41.327438095745315]
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach for aligning Large Language Models with human preferences.<n>Existing methods work by emulating the preferences at the single decision (turn) level.<n>We develop novel methods for Reinforcement Learning from preference feedback between two full multi-turn conversations.
arXiv Detail & Related papers (2024-05-23T14:53:54Z)
SALMON: Self-Alignment with Instructable Reward Models [80.83323636730341]
This paper presents a novel approach, namely SALMON, to align base language models with minimal human supervision. We develop an AI assistant named Dromedary-2 with only 6 exemplars for in-context learning and 31 human-defined principles.
arXiv Detail & Related papers (2023-10-09T17:56:53Z)
MA2CL:Masked Attentive Contrastive Learning for Multi-Agent Reinforcement Learning [128.19212716007794]
We propose an effective framework called textbfMulti-textbfAgent textbfMasked textbfAttentive textbfContrastive textbfLearning (MA2CL) MA2CL encourages learning representation to be both temporal and agent-level predictive by reconstructing the masked agent observation in latent space. Our method significantly improves the performance and sample efficiency of different MARL algorithms and outperforms other methods in various vision-based and state-based scenarios.
arXiv Detail & Related papers (2023-06-03T05:32:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.