Fugu-MT 論文翻訳(概要): Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

論文の概要: Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

arxiv url: http://arxiv.org/abs/2603.17051v1
Date: Tue, 17 Mar 2026 18:32:18 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.350052
Title: Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models
Title（参考訳）: Astrolabe: 蒸留自己回帰ビデオモデルのためのステアリングフォワード強化学習
Authors: Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao,
Abstract要約: 蒸留自己回帰(AR)ビデオモデルは、効率的なストリーミング生成を可能にするが、しばしば人間の視覚的嗜好に反する。蒸留ARモデルに適した効率的なオンライン強化学習フレームワークであるAstrolabeを提案する。
参考スコア（独自算出の注目度）: 58.3184497327891
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.
Abstract（参考訳）: 蒸留自己回帰(AR)ビデオモデルは、効率的なストリーミング生成を可能にするが、しばしば人間の視覚的嗜好に反する。既存の強化学習(RL)フレームワークは、これらのアーキテクチャに自然に適合するものではなく、通常、高価な再蒸留か、メモリと計算オーバーヘッドをかなり導入するソルバ結合の逆プロセス最適化を必要とする。蒸留ARモデルに適した効率的なオンラインRLフレームワークであるAstrolabeを提案する。既存のボトルネックを克服するために、負の認識による微調整に基づく前処理RLの定式化を導入する。推論エンドポイントで直接正と負のサンプルを対比することにより、逆プロセスのアンロールを必要とせずに、暗黙のポリシー改善の方向性を確立する。このアライメントを長時間ビデオにスケールするために、ローリングKV-cacheを介して順次シーケンスを生成するストリーミングトレーニング手法を提案し、RL更新をローカルクリップウィンドウにのみ適用し、事前のコンテキストを条件付けして長距離コヒーレンスを確保する。最後に、報酬ハッキングを緩和するために、不確実性を認識した選択正規化と動的参照更新によって安定化されたマルチリワード目標を統合する。大規模な実験により,本手法は複数の蒸留ARビデオモデルに対して連続的に生成品質を向上し,ロバストでスケーラブルなアライメントソリューションとして機能することが示された。

論文の概要: Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

関連論文リスト