Fugu-MT 論文翻訳(概要): Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs

論文の概要: Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs

arxiv url: http://arxiv.org/abs/2510.21807v1
Date: Tue, 21 Oct 2025 08:50:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-28 17:41:21.902496
Title: Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs
Title（参考訳）: VLMにおけるマスケード予測による視覚コンテキストの活性化とコモンセンス推論
Authors: Jiaao Yu, Shenwei Li, Mingjie Han, Yifei Yin, Wenzheng Song, Chenghao Jia, Man Lan,
Abstract要約: 本稿では,コンテキストとコモンセンスを用いた新しい微調整タスクであるMasked Predictionを導入する。このタスクは、隠蔽された画像から意味的に意味のあるコンテンツを再構成することで、視覚的コンテキストと常識的推論を統合するようモデルに強制する。また,先行サンプリングによる強化ファインチューニングというイノベーティブなトレーニング手法も導入する。
参考スコア（独自算出の注目度）: 9.953258838113
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a significant gap persists in their adaptation to real world multimodal scenarios, most notably, vision language tasks, due to a heavy focus on single modal language settings. While efforts to transplant reinforcement learning techniques from NLP to VLMs have emerged, these approaches often remain confined to perception centric tasks or reduce images to textual summaries, failing to fully exploit visual context and commonsense knowledge, ultimately constraining the generalization of reasoning capabilities across diverse multimodal environments. To address this limitation, we introduce a novel fine tuning task, Masked Prediction via Context and Commonsense, which forces models to integrate visual context and commonsense reasoning by reconstructing semantically meaningful content from occluded images, thereby laying the foundation for generalized reasoning. To systematically evaluate the model performance in generalized reasoning, we developed a specialized evaluation benchmark, MPCC Eval, and employed various fine tuning strategies to guide reasoning. Among these, we introduced an innovative training method, Reinforcement Fine tuning with Prior Sampling, which not only enhances model performance but also improves its generalized reasoning capabilities in OOD and cross task scenarios.
Abstract（参考訳）: 最近の推論モデルにおけるブレークスルーは、特に検証可能な報酬を持つタスクのトレーニングを通じて、大きな言語モデルの推論能力を著しく向上させてきた。しかし、特に視覚言語タスクは、単一のモーダル言語設定に重きを置いているため、現実のマルチモーダルシナリオへの適応において大きなギャップが持続する。強化学習技術をNLPからVLMに移植する試みは現れているが、これらのアプローチは認識中心のタスクや画像のテキスト要約に制限されることが多く、視覚的コンテキストと常識知識を完全に活用できず、最終的には多様なマルチモーダル環境における推論能力の一般化を制限している。この制限に対処するために、我々は、コンテキストとコモンセンスによるマスケ予測という新しい微調整タスクを導入する。これは、隠蔽された画像から意味的に意味のあるコンテンツを再構成することで、モデルに視覚的コンテキストとコモンセンス推論を統合することを強制し、一般化された推論の基礎を築き上げる。一般化推論におけるモデル性能を体系的に評価するために,特殊評価ベンチマークMPCC Evalを開発した。そこで我々は,モデル性能を向上するだけでなく,OODおよびクロスタスクシナリオにおける一般化推論能力の向上を図った,革新的なトレーニング手法であるReinforcement Fine tuning with Prior Samplingを導入した。

論文の概要: Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs

関連論文リスト