Fugu-MT 論文翻訳(概要): OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration

論文の概要: OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration

arxiv url: http://arxiv.org/abs/2604.02349v1
Date: Thu, 19 Feb 2026 02:11:01 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.323905
Title: OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration
Title（参考訳）: OPRIDE:データ内探索によるオフライン推論に基づく強化学習
Authors: Yiqin Yang, Hao Hu, Yihuan Mao, Jin Zhang, Chengjie Wu, Yuhua Jiang, Xu Yang, Runpeng Xie, Yi Fan, Bo Liu, Yang Gao, Bo Xu, Chongjie Zhang,
Abstract要約: 嗜好に基づく強化学習(PbRL)は、洗練された報酬設計を回避し、人間の意図に適合する。オフラインPbRLのクエリ効率を向上させるために,textbfIn-textbfDataset textbfExploration (OPRIDE) を用いた新しいアルゴリズム textbfOffline textbfRL を提案する。
参考スコア（独自算出の注目度）: 42.70370800703202
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Preference-based reinforcement learning (PbRL) can help avoid sophisticated reward designs and align better with human intentions, showing great promise in various real-world applications. However, obtaining human feedback for preferences can be expensive and time-consuming, which forms a strong barrier for PbRL. In this work, we address the problem of low query efficiency in offline PbRL, pinpointing two primary reasons: inefficient exploration and overoptimization of learned reward functions. In response to these challenges, we propose a novel algorithm, \textbf{O}ffline \textbf{P}b\textbf{R}L via \textbf{I}n-\textbf{D}ataset \textbf{E}xploration (OPRIDE), designed to enhance the query efficiency of offline PbRL. OPRIDE consists of two key features: a principled exploration strategy that maximizes the informativeness of the queries and a discount scheduling mechanism aimed at mitigating overoptimization of the learned reward functions. Through empirical evaluations, we demonstrate that OPRIDE significantly outperforms prior methods, achieving strong performance with notably fewer queries. Moreover, we provide theoretical guarantees of the algorithm's efficiency. Experimental results across various locomotion, manipulation, and navigation tasks underscore the efficacy and versatility of our approach.
Abstract（参考訳）: 嗜好に基づく強化学習(PbRL)は、洗練された報酬設計を回避し、人間の意図と整合し、様々な現実世界のアプリケーションで大きな可能性を秘めている。しかし、人間の好みに対するフィードバックを得るには費用がかかり、時間がかかり、PbRLの強い障壁となる。本研究では、オフラインPbRLにおけるクエリ効率の低い問題に対処し、非効率な探索と学習された報酬関数の過度な最適化という2つの主な理由を指摘した。これらの課題に対応するために、オフラインPbRLのクエリ効率を高めるために、新しいアルゴリズムである \textbf{O}ffline \textbf{P}b\textbf{R}L を、 \textbf{I}n-\textbf{D}ataset \textbf{E}xploration (OPRIDE) によって提案する。 OPRIDEは2つの重要な特徴から構成される。クエリのインフォメーション性を最大化する探索戦略と、学習された報酬関数の過度な最適化を緩和することを目的としたディスカウントスケジューリング機構である。経験的評価により,OPRIDEは従来手法よりも大幅に優れており,クエリが顕著に少なく,高い性能を実現していることを示す。さらに,アルゴリズムの効率を理論的に保証する。様々な移動,操作,ナビゲーションタスクにまたがる実験結果は,我々のアプローチの有効性と汎用性を示している。

論文の概要: OPRIDE: Offline Preference-based Reinforcement Learning via In-Dataset Exploration

関連論文リスト