Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs
- URL: http://arxiv.org/abs/2510.21807v1
- Date: Tue, 21 Oct 2025 08:50:11 GMT
- Title: Activating Visual Context and Commonsense Reasoning through Masked Prediction in VLMs
- Authors: Jiaao Yu, Shenwei Li, Mingjie Han, Yifei Yin, Wenzheng Song, Chenghao Jia, Man Lan,
- Abstract summary: We introduce a novel fine tuning task, Masked Prediction via Context and Commonsense.<n>This task forces models to integrate visual context and commonsense reasoning by reconstructing semantically meaningful content from occluded images.<n>We also introduce an innovative training method, Reinforcement Fine tuning with Prior Sampling.
- Score: 9.953258838113
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent breakthroughs in reasoning models have markedly advanced the reasoning capabilities of large language models, particularly via training on tasks with verifiable rewards. Yet, a significant gap persists in their adaptation to real world multimodal scenarios, most notably, vision language tasks, due to a heavy focus on single modal language settings. While efforts to transplant reinforcement learning techniques from NLP to VLMs have emerged, these approaches often remain confined to perception centric tasks or reduce images to textual summaries, failing to fully exploit visual context and commonsense knowledge, ultimately constraining the generalization of reasoning capabilities across diverse multimodal environments. To address this limitation, we introduce a novel fine tuning task, Masked Prediction via Context and Commonsense, which forces models to integrate visual context and commonsense reasoning by reconstructing semantically meaningful content from occluded images, thereby laying the foundation for generalized reasoning. To systematically evaluate the model performance in generalized reasoning, we developed a specialized evaluation benchmark, MPCC Eval, and employed various fine tuning strategies to guide reasoning. Among these, we introduced an innovative training method, Reinforcement Fine tuning with Prior Sampling, which not only enhances model performance but also improves its generalized reasoning capabilities in OOD and cross task scenarios.
Related papers
- Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z) - Youtu-VL: Unleashing Visual Potential via Unified Vision-Language Supervision [79.06371915084833]
We introduce Youtu-VL, a framework leveraging the Vision-Language Unified Autoregressive Supervision (VLUAS) paradigm.<n>Youtu-VL applies unified autoregressive supervision to both visual details and linguistic content.<n>We extend this paradigm to encompass vision-centric tasks, enabling a standard VLM to perform vision-centric tasks without task-specific additions.
arXiv Detail & Related papers (2026-01-27T17:01:16Z) - Latent Implicit Visual Reasoning [59.39913238320798]
We propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision.<n>Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks.
arXiv Detail & Related papers (2025-12-24T14:59:49Z) - MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning [20.14427952871989]
We introduce MMRPT, a multimodal reinforcement pre-training framework that strengthens visual reasoning in MLLMs.<n>We are the first to incorporate reinforcement learning directly into the pre-training of large vision-language models.<n>Experiments show consistent zero-shot gains across diverse benchmarks and substantially improved robustness under supervised fine-tuning.
arXiv Detail & Related papers (2025-12-08T06:26:13Z) - Reinforcing Spatial Reasoning in Vision-Language Models with Interwoven Thinking and Visual Drawing [62.447497430479174]
Drawing to reason in space is a novel paradigm that enables LVLMs to reason through elementary drawing operations in the visual space.<n>Our model, named VILASR, consistently outperforms existing methods across diverse spatial reasoning benchmarks.
arXiv Detail & Related papers (2025-06-11T17:41:50Z) - Unveiling the Compositional Ability Gap in Vision-Language Reasoning Model [39.58344147240552]
We investigate whether large vision-language models (VLMs) can compose capabilities across modalities or tasks under out-of-distribution conditions.<n>Our findings shed light on the current limitations of RL-based reasoning VLM training and provide actionable insights toward building models that reason compositionally across modalities and tasks.
arXiv Detail & Related papers (2025-05-26T01:42:38Z) - Integrating Visual Interpretation and Linguistic Reasoning for Math Problem Solving [61.992824291296444]
Current large vision-language models (LVLMs) typically employ a connector module to link visual features with text embeddings of large language models (LLMs)<n>This paper proposes a paradigm shift: instead of training end-to-end vision-language reasoning models, we advocate for developing a decoupled reasoning framework.
arXiv Detail & Related papers (2025-05-23T08:18:00Z) - Praxis-VLM: Vision-Grounded Decision Making via Text-Driven Reinforcement Learning [16.938301925105097]
This paper shows that Vision Language Models can achieve surprisingly strong decision-making performance when visual scenes are replaced by textual descriptions.<n>We propose Praxis-VLM, a reasoning VLM for vision-grounded decision-making.
arXiv Detail & Related papers (2025-03-21T09:25:23Z) - Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning [12.728451197053321]
We propose Curriculum Reinforcement Finetuning (Curr-ReFT), a novel post-training paradigm specifically designed for small-scale vision-language models (VLMs)<n>Curr-ReFT comprises two sequential stages: Curriculum Reinforcement Learning and Rejected Sampling-based Self-improvement.<n>Our experiments demonstrate that models trained with Curr-ReFT paradigm achieve state-of-the-art performance across various visual tasks.
arXiv Detail & Related papers (2025-03-10T08:48:50Z) - Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality [69.76121008898677]
Fine-grained Selective Calibrated CLIP integrates local hard negative loss and selective calibrated regularization.
Our evaluations show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities.
arXiv Detail & Related papers (2024-10-07T17:16:20Z) - PALM: Predicting Actions through Language Models [74.10147822693791]
We introduce PALM, an approach that tackles the task of long-term action anticipation.
Our method incorporates an action recognition model to track previous action sequences and a vision-language model to articulate relevant environmental details.
Our experimental results demonstrate that PALM surpasses the state-of-the-art methods in the task of long-term action anticipation.
arXiv Detail & Related papers (2023-11-29T02:17:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.