Fugu-MT 論文翻訳(概要): Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

論文の概要: Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

arxiv url: http://arxiv.org/abs/2606.18974v1
Date: Wed, 17 Jun 2026 11:59:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:51.150213
Title: Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning
Title（参考訳）: Visual-OPSD: 効率的な統合マルチモーダル推論のためのクロスモーダルオンポリシィ自己蒸留
Authors: Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu,
Abstract要約: 統一マルチモーダルモデル (UMM) は、空間的タスクを改善するためにテキスト推論で'視覚的思考' (VT) を生成する。これは、多段階拡散から大まかにマグニチュード推論コストを発生させる。本稿では、この問題を解決するために、ビジュアルオンポリシィ自己蒸留(Visual-OPSD)を提案する。
参考スコア（独自算出の注目度）: 30.851590309402436
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.
Abstract（参考訳）: 統一マルチモーダルモデル (UMM) は、空間的タスクを改善するためにテキスト推論で'視覚的思考' (VT) を生成する。これは、多段階拡散から大まかにマグニチュード推論コストを発生させる。このコストは直接利益に制限がある。 ThinkMorphでは、VTの削除またはノイズ付けは9つのベンチマークでほとんど正確性を変えない。一度レンダリングされると、コンテンツに関係なくVTに注意が集中する。しかし、KL診断は、特権付きVTトレースの条件付けがモデルの完了分布をシフトさせることを示している。これは、生成経路がレンダリングされたピクセルを超えて有用な推論を符号化していることを示している。このギャップに触発され、我々はVisual On-Policy Self-Distillation (Visual-OPSD)を提案する。教師と生徒は同じ重みを共有しているが、文脈によって異なる: 教師は特権付きVTを、生徒は質問のみを見る。オンライン学生軌跡におけるトークンレベルのJSD蒸留は、教師の推論をテキストのみの学生に伝達する。 9つのベンチマークで、Visual-OPSDは生成教師を+3.40$pp、14.3$times$ Speedup (10.0s vs. 142.8s)で改善し、VSPでは+63.83$ppで同じスケールのVLMを上回っている。ガウスノイズ制御(実VTでは+0.40$pp対$+10.28$pp)と58.4\%のKLギャップの閉鎖により、生成経路のセマンティックコンテンツから得られることが確認される。

論文の概要: Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

関連論文リスト