Related papers: RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding

RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding

URL: http://arxiv.org/abs/2602.00504v1
Date: Sat, 31 Jan 2026 04:13:57 GMT
Title: RGBX-R1: Visual Modality Chain-of-Thought Guided Reinforcement Learning for Multimodal Grounding
Authors: Jiahe Wu, Bing Cao, Qilong Wang, Qinghua Hu, Dongdong Li, Pengfei Zhu,
Abstract summary: Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality.<n>We propose RGBX-R1, a framework to enhance MLLM's perception and reasoning capacities across various X visual modalities.
Score: 69.98331019544166
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal Large Language Models (MLLM) are primarily pre-trained on the RGB modality, thereby limiting their performance on other modalities, such as infrared, depth, and event data, which are crucial for complex scenarios. To address this, we propose RGBX-R1, a framework to enhance MLLM's perception and reasoning capacities across various X visual modalities. Specifically, we employ an Understand-Associate-Validate (UAV) prompting strategy to construct the Visual Modality Chain-of-Thought (VM-CoT), which aims to expand the MLLMs' RGB understanding capability into X modalities. To progressively enhance reasoning capabilities, we introduce a two-stage training paradigm: Cold-Start Supervised Fine-Tuning (CS-SFT) and Spatio-Temporal Reinforcement Fine-Tuning (ST-RFT). CS-SFT supervises the reasoning process with the guidance of VM-CoT, equipping the MLLM with fundamental modality cognition. Building upon GRPO, ST-RFT employs a Modality-understanding Spatio-Temporal (MuST) reward to reinforce modality reasoning. Notably, we construct the first RGBX-Grounding benchmark, and extensive experiments verify our superiority in multimodal understanding and spatial perception, outperforming baselines by 22.71% on three RGBX grounding tasks.

Related papers

Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning [15.894854593567963]
Reinforcement learning can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks.<n>We build an updated vision-language model based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability.<n>We find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results.
arXiv Detail & Related papers (2025-12-11T14:36:14Z)
Co-Training Vision Language Models for Remote Sensing Multi-task Learning [68.15604397741753]
Vision language models (VLMs) have achieved promising results in RS image understanding, grounding, and ultra-high-resolution (UHR) image reasoning.<n>We present RSCoVLM, a simple yet flexible VLM baseline for RS MTL.<n>We propose a unified dynamic-resolution strategy to address the diverse image scales inherent in RS imagery.
arXiv Detail & Related papers (2025-11-26T10:55:07Z)
Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation [52.11339614452127]
Vision-and-Language Navigation (VLN) requires an agent to dynamically explore complex 3D environments following human instructions.<n>Recent research underscores the potential of harnessing large language models (LLMs) for VLN, given their commonsense knowledge and general reasoning capabilities.<n>We propose a novel dual-process thinking framework dubbed R3, integrating LLMs' generalization capabilities with VLN-specific expertise in a zero-shot manner.
arXiv Detail & Related papers (2025-11-18T04:32:00Z)
Actial: Activate Spatial Reasoning Ability of Multimodal Large Language Models [75.45940282834327]
We introduce Viewpoint Learning, a task designed to evaluate and improve the spatial reasoning capabilities of MLLMs.<n>We present the Viewpoint-100K dataset, consisting of 100K object-centric image pairs with diverse viewpoints and corresponding question-answer pairs.<n>Our approach employs a two-stage fine-tuning strategy, resulting in significant improvements across multiple tasks.
arXiv Detail & Related papers (2025-11-03T14:27:00Z)
HOID-R1: Reinforcement Learning for Open-World Human-Object Interaction Detection Reasoning with Multimodal Large Language Model [13.82578761807402]
We introduce HOID-R1, the first HOI detection framework that integrates chain-of-thought (CoT) guided fine-tuning with group relative policy optimization.<n>To mitigate hallucinations in the CoT reasoning, we introduce an "MLLM-as-a-judge" mechanism that supervises the CoT outputs.<n>Experiments show that HOID-R1 achieves state-of-the-art performance on HOI detection benchmarks and outperforms existing methods in open-world generalization to novel scenarios.
arXiv Detail & Related papers (2025-08-15T09:28:57Z)
Reinforcement Learning Tuning for VideoLLMs: Reward Design and Data Efficiency [56.475612147721264]
We propose a dual-reward formulation that supervises both semantic and temporal reasoning through discrete and continuous reward signals.<n>We evaluate our approach across eight representative video understanding tasks, including VideoQA, Temporal Video Grounding, and Grounded VideoQA.<n>Results underscore the importance of reward design and data selection in advancing reasoning-centric video understanding with MLLMs.
arXiv Detail & Related papers (2025-06-02T17:28:26Z)
EarthGPT-X: A Spatial MLLM for Multi-level Multi-Source Remote Sensing Imagery Understanding with Visual Prompting [46.44805092655782]
EarthGPT-X is proposed, the first flexible spatial MLLM that unifies multi-source RS imagery comprehension.<n>It accomplishes both coarse-grained and fine-grained visual tasks under diverse visual prompts in a single framework.
arXiv Detail & Related papers (2025-04-17T09:56:35Z)
Exploring the Effect of Reinforcement Learning on Video Understanding: Insights from SEED-Bench-R1 [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark designed to evaluate post-training methods for MLLMs in video understanding.<n>It includes intricate real-world videos and complex everyday planning tasks in the format of multiple-choice questions.<n>Using Qwen2-VL-Instruct-7B as a base model, we compare RL with supervised fine-tuning (SFT)<n>Our detailed analysis reveals that RL enhances visual perception but often produces less coherent reasoning chains.
arXiv Detail & Related papers (2025-03-31T17:55:23Z)
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning of Vision Language Models [42.75418134743927]
Reason-RFT is a two-stage reinforcement fine-tuning framework for visual reasoning.<n>First,Supervised Fine-Tuning (SFT) with curated CoT data activates the reasoning potential of Vision-Language Models (VLMs)<n>Second, reinforcement learning based on Group Relative Policy Optimization (GRPO) generates multiple reasoning-response pairs to enhance adaptability to domain shifts.
arXiv Detail & Related papers (2025-03-26T17:38:06Z)
UniRGB-IR: A Unified Framework for Visible-Infrared Semantic Tasks via Adapter Tuning [34.727262809777095]
We propose UniRGB-IR, a scalable and efficient framework for RGB-IR semantic tasks.<n>Our framework comprises three key components: a vision transformer (ViT) foundation model, a Multi-modal Feature Pool (SFI) module, and a Supplementary Feature (SFI) module.<n> Experimental results on various RGB-IR semantic tasks demonstrate that our method can achieve state-of-the-art performance.
arXiv Detail & Related papers (2024-04-26T12:21:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.