Fugu-MT 論文翻訳(概要): Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

論文の概要: Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

arxiv url: http://arxiv.org/abs/2605.11651v2
Date: Wed, 13 May 2026 01:49:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.890414
Title: Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
Title（参考訳）: 動画で見る「Hide to See」:VLM蒸留における視覚的思考のためのプレフィックス・マスク
Authors: Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son,
Abstract要約: 本稿では,学生に視覚情報に対する思考の定着を促す新しい思考答え蒸留フレームワークを提案する。蒸留段階では,学生は将来のトークンと有意な推論手段の両方をブロックする有意な推論マスクによって指導される。実験結果から,本手法は最近のオープンソースのVLM, VLM蒸留, およびマルチモーダル推論ベンチマークにおける自己蒸留法よりも優れていることがわかった。
参考スコア（独自算出の注目度）: 16.537720911494066
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their high computational cost limits real-world deployment. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, {measured by discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyses confirm enhanced visual utilization along the student thinking process.
Abstract（参考訳）: Qwen3-VL-Thinkingのような近年のVLMにおける思考答えアプローチは、最終回答の前に中間的な思考ステップを活用することによって推論性能を向上させるが、計算コストが高いため実世界の展開が制限される。このような能力をコンパクトなシンク・インサーバーVLM(英語版)に蒸留するために、第一の目的は、学生がその推論トレースを通じて視覚的エビデンスを活用する能力を改善することである。そこで本研究では,学生が視覚情報に対する思考を抑えるために,学生の有能な推論プレフィックスを隠蔽する新しい思考・回答蒸留フレームワークを提案する。このようなマスクされたテキストの手がかりを補うために、学生は蒸留中の代替情報源として視覚的証拠に頼ることが奨励されている。私たちのマスキング戦略は以下のとおりです。 1)次の予測毎に高い影響の推論プレフィックスを選択的にマスキングするトークンワイドな有意な推論-修正マスク 2) 自給式マスキング予算は, 蒸留難度に応じて徐々にマスクの規模を拡大し, 教師と学生の配当の相違によって測定される。蒸留段階では,自動回帰言語モデリングに使用される標準的な因果マスクの代わりに,将来のトークンと有理推論の両方をブロックする有理推論-修正マスクによって指導される。実験結果から,近年のオープンソースのVLM, VLM蒸留法, 自己蒸留法, マルチモーダル推論ベンチマークよりも優れており, さらに, 学生の思考過程における視覚的利用が向上していることが確認された。

論文の概要: Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

関連論文リスト