Fugu-MT 論文翻訳(概要): Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

論文の概要: Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

arxiv url: http://arxiv.org/abs/2605.29028v1
Date: Wed, 27 May 2026 19:24:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 02:45:55.345754
Title: Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning
Title（参考訳）: Return-to-Go:Q-Guided Alignment for Return-Conditioned Supervised Learning
Authors: Yuxiao Yang, Weitong Zhang,
Abstract要約: 条件付きシーケンスモデル(CSM)は、RTG(Return-to-go)を制御信号として扱うことでポリシーを学習する。このアライメントを強制するフレームワークであるQ-ALIGN DTを提案する。本稿では,Q-ALIGN DTが所望のポリシーを効率的に学習し,ほぼ最適に出力できることを示す。
参考スコア（独自算出の注目度）: 18.76637029534068
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.
Abstract（参考訳）: 条件付きシーケンスモデル(CSM)は、RTG(Return-to-go)を制御信号として扱うことでポリシーを学習する。しかし、既存のCSMでは、RTGをポリシーの性能と整合させるのではなく、単純な数値入力として扱うことが多い。本稿では、出力ポリシーの$Q$-valueが入力RTGと一致していることを保証することで、このアライメントを強制するフレームワークであるQ-ALIGN DTを提案する。 Q$関数を利用してCSMに高密度なガイダンスを提供し、さらにCSMを用いたRTG摂動技術を用いて微調整することで、より高いRTGを高い期待値のトラジェクトリに一貫したマッピングを可能にする。理論的には,RTGが十分に高い場合,Q-ALIGN DT が所望のポリシーを効率的に学習し,最適に近いポリシーを出力できることが示されている。実験により,Q-ALIGN DT が D4RL ベンチマークにおいて優れた制御性と性能を達成できることを示す。注目すべきことに、我々のモデルは、正確なアライメントを維持し、事前の手法が失敗するベロシティ追跡のようなタスクに一般化する、構造化されたポリシーの族を効果的に学習する。

論文の概要: Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

関連論文リスト