Fugu-MT 論文翻訳(概要): Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

論文の概要: Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

arxiv url: http://arxiv.org/abs/2605.09640v1
Date: Sun, 10 May 2026 16:36:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:50.347746
Title: Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning
Title（参考訳）: 強化微調整による視覚連続学習におけるカタストロフィック・フォーミングの克服
Authors: Meng Lou, Hanzhong Guo, Linwei Chen, Yizhou Yu,
Abstract要約: 補強細管 (RFT) は, スーパービジョン細管 (SFT) よりも本質的に破滅的忘れに対する耐性が高いことを示す。本稿では,軌道レベルの報酬形成による忘れを明示的に緩和するシンプルなRFT手法であるRetention-Aware Policy Optimization (RaPO)を提案する。 RaPOは高い塑性を維持しながら破滅的な忘れを著しく減少させる。
参考スコア（独自算出の注目度）: 44.7099384060866
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies suggest that Reinforcement Fine-Tuning (RFT) is inherently more resilient to catastrophic forgetting than Supervised Fine-Tuning (SFT). However, whether RFT (e.g., GRPO) can effectively overcome forgetting in challenging visual continual learning settings, such as class-incremental learning (CIL) and domain-incremental learning (DIL), remains an open problem. Through a pilot study, we confirm that while RFT consistently outperforms SFT, it still suffers from non-negligible forgetting. We empirically trace this bottleneck to Trajectory-level Drift Agnosticism: among candidate rollouts achieving identical task rewards, the KL divergence from the preceding-task policy varies substantially, which strongly correlates with catastrophic forgetting across sequential tasks. Motivated by this insight, we propose Retention-aware Policy Optimization (RaPO), a simple yet effective RFT method that explicitly mitigates forgetting through trajectory-level reward shaping. Specifically, RaPO comprises two core components: (1) Retention Reward that converts trajectory-level distribution drift into a continuous reward signal, preferentially reinforcing knowledge-preserving rollouts within each group; (2) Cross-Task Advantage Normalization (CTAN), which maintains a persistent exponential moving average of reward statistics across task boundaries to stabilize the optimization progress during continual learning. Leveraging the free-form textual generalization of MLLMs, we comprehensively evaluate RaPO across five visual continual learning settings. Extensive experiments demonstrate that RaPO achieves leading performance, substantially reducing catastrophic forgetting while preserving strong plasticity. To the best of our knowledge, this work represents the first systematic exploration of RFT in visual continual learning, offering insights that we hope will inspire future research.
Abstract（参考訳）: 近年の研究では、強化細管 (RFT) は本質的にスーパービジョン細管 (SFT) よりも破滅的な忘れ方に耐性があることが示唆されている。しかし、クラスインクリメンタルラーニング(CIL)やドメインインクリメンタルラーニング(DIL)といった難解な視覚的連続学習環境において、RTT(例えばGRPO)が忘れを効果的に克服できるかどうかは未解決の問題である。パイロットスタディを通じて、RFTは一貫してSFTより優れていますが、それでも忘れることに悩まされています。我々は、このボトルネックを、同一のタスク報酬を達成するための候補ロールアウトの中で、先行タスクポリシーからのKL偏差が著しく変化し、シーケンシャルタスク間の破滅的な忘れ込みと強く相関する、トラジェクティブレベルのドリフト・アグノスティック主義に実証的に追従する。この知見に触発されて,軌道レベルの報酬形成による忘れを明示的に緩和するシンプルなRFT法であるRetention-Aware Policy Optimization (RaPO)を提案する。具体的には、(1)軌道レベルの分布ドリフトを連続的な報酬信号に変換するリテンション・リワード(Retention Reward)、(2)各グループ内の知識保存ロールアウトを優先的に強化するリテンション・リワード(Retention Reward)、(2)タスク・アドバンテージ・正規化(CTAN)の2つのコアコンポーネントから構成される。 MLLMの自由形式のテキスト一般化を利用して、5つの視覚的連続学習環境においてRaPOを総合的に評価する。急激な実験により、RaPOは高い塑性を維持しながら破滅的な忘れを著しく減らし、先進的な性能を発揮することが示されている。我々の知る限りでは、この研究は視覚的連続学習におけるRFTの体系的な調査としては初めてであり、今後の研究に刺激を与えることを期待している。

論文の概要: Overcoming Catastrophic Forgetting in Visual Continual Learning with Reinforcement Fine-Tuning

関連論文リスト