Fugu-MT 論文翻訳(概要): Unifying Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning with Heterogeneous Datasets

論文の概要: Unifying Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning with Heterogeneous Datasets

arxiv url: http://arxiv.org/abs/2605.24862v1
Date: Sun, 24 May 2026 04:44:44 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-26 19:50:18.4828
Title: Unifying Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning with Heterogeneous Datasets
Title（参考訳）: 不均一データセットを用いたクロスドメインオフライン強化学習における値アライメントとアライメントの統一
Authors: Zhongjian Qiao, Jiafei Lyu, Chenjia Bai, Peisong Wang, Siyang Gao, Shuang Qiu,
Abstract要約: クロスドメインオフライン強化学習(RL)は、ターゲットドメインの限られたデータセットと、動的シフトを示すソースドメインのデータセットを使用して、ターゲットドメインのポリシを学習することを目的としている。近年の研究では、動的アライメントや値アライメントの観点からデータフィルタリングを行い、効率的なポリシ転送を実現している。値のミスアサインメントは、値アライメントを損なう可能性を示し、サブ最適サンプルの選択に向けたデータフィルタリングをミスリードし、サブ最適ギャップを緩める。本稿では,動的アライメント,値アライメント,値アライメントを統合したV2Aを提案する。
参考スコア（独自算出の注目度）: 41.41933463623304
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Cross-domain offline reinforcement learning (RL) aims to learn a policy in the target domain with a limited target domain dataset and a source domain dataset that exhibits a dynamics shift. Training directly on the original source dataset typically leads to performance collapse. Recent studies perform data filtering from the perspective of dynamics alignment or value alignment to enable efficient policy transfer. However, these studies are typically validated on single-domain or single-behavior-policy source datasets. In this work, we explore a more general heterogeneous cross-domain offline RL setting, where the source datasets may be collected from multiple source domains by diverse behavior policies. We first uncover a critical yet overlooked issue in this setting: value misassignment. Empirically and theoretically, we demonstrate that value misassignment can undermine value alignment, mislead data filtering toward selecting suboptimal samples, and loosen the suboptimality gap, thereby degrading the agent's performance. To address this issue, we propose V2A, which integrates dynamics alignment, value alignment, and value assignment. V2A first employs temporally-consistent modality representation learning to extract dynamics modalities from the source dataset, followed by modality-aware advantage learning to rectify value alignment. Finally, it adopts a data filtering paradigm to selectively share source data for policy learning. Empirical results show that V2A significantly outperforms strong baseline methods under general heterogeneous cross-domain offline RL settings.
Abstract（参考訳）: クロスドメインオフライン強化学習(RL)は、ターゲットドメインの限られたデータセットと、動的シフトを示すソースドメインのデータセットを使用して、ターゲットドメインのポリシを学習することを目的としている。オリジナルのソースデータセットを直接トレーニングすることは、通常、パフォーマンスの崩壊につながる。近年の研究では、動的アライメントや値アライメントの観点からデータフィルタリングを行い、効率的なポリシ転送を実現している。しかしながら、これらの研究は通常、単一ドメインまたは単一ビヘイビア・ポリシーソースデータセットで検証される。本研究では、より一般的な異種クロスドメインオフラインRL設定について検討する。そこでは、多様な振る舞いポリシーにより、ソースデータセットを複数のソースドメインから収集することができる。私たちはまず、この設定で重要で見過ごされた問題を発見しました。経験的,理論的には,値ミス割り当てが値アライメントを損なうこと,最適なサンプル選択に向けたデータフィルタリングをミスリードすること,最適以下のギャップを緩めることにより,エージェントの性能を低下させることを実証する。この問題に対処するために、動的アライメント、値アライメント、値アライメントを統合したV2Aを提案する。 V2Aは、まず、時間的に一貫性のあるモダリティ表現学習を用いて、ソースデータセットから動的モダリティを抽出し、次いで、モダリティを意識したアドバンテージ学習により、値アライメントの是正を行う。最後に、ポリシー学習のためにソースデータを選択的に共有するために、データフィルタリングパラダイムを採用する。実験により、V2Aは、一般の異種クロスドメインオフラインRL設定下では、強いベースライン法よりも著しく優れていることが示された。

論文の概要: Unifying Value Alignment and Assignment in Cross-Domain Offline Reinforcement Learning with Heterogeneous Datasets

関連論文リスト