GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
- URL: http://arxiv.org/abs/2509.25026v3
- Date: Tue, 14 Oct 2025 08:30:43 GMT
- Title: GeoVLM-R1: Reinforcement Fine-Tuning for Improved Remote Sensing Reasoning
- Authors: Mustansar Fiaz, Hiyam Debary, Paolo Fraccaro, Danda Paudel, Luc Van Gool, Fahad Khan, Salman Khan,
- Abstract summary: We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse Earth Observation tasks.<n>This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness.
- Score: 47.13305707860122
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent advances in reinforcement learning (RL) have delivered strong reasoning capabilities in natural image domains, yet their potential for Earth Observation (EO) remains largely unexplored. EO tasks introduce unique challenges, spanning referred object detection, image or region captioning, change detection, grounding, and temporal analysis, that demand task aware reasoning. We propose a novel post training framework that incorporates task aware rewards to enable effective adaptation of reasoning based RL models to diverse EO tasks. This training strategy enhances reasoning capabilities for remote sensing images, stabilizes optimization, and improves robustness. Extensive experiments across multiple EO benchmarks show consistent performance gains over state of the art generic and specialized vision language models. Code and models will be released publicly at https://mustansarfiaz.github.io/GeoVLM-R1/ .
Related papers
- More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models [17.431298099935344]
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs)<n>Recent research has sought to extend reasoning to Vision-Language Models (VLMs)<n>Our study uncovers the dual nature of multimodal reasoning, leading to recognition failures on otherwise basic visual questions.<n>We propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories.
arXiv Detail & Related papers (2025-09-30T06:37:47Z) - ThinkFake: Reasoning in Multimodal Large Language Models for AI-Generated Image Detection [51.93101033997245]
Increasing realism of AI-generated images has raised serious concerns about misinformation and privacy violations.<n>We propose ThinkFake, a novel reasoning-based and generalizable framework for AI-generated image detection.<n>We show that ThinkFake outperforms state-of-the-art methods on the GenImage benchmark and demonstrates strong zero-shot generalization on the challenging LOKI benchmark.
arXiv Detail & Related papers (2025-09-24T07:34:09Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - PeRL: Permutation-Enhanced Reinforcement Learning for Interleaved Vision-Language Reasoning [50.21619363035618]
We propose a general reinforcement learning approach PeRL tailored for interleaved multimodal tasks.<n>We introduce permutation of image sequences to simulate varied positional relationships to explore more spatial and positional diversity.<n>Our experiments confirm that PeRL trained model consistently surpasses R1-related and interleaved VLM baselines by a large margin.
arXiv Detail & Related papers (2025-06-17T18:25:56Z) - Agentic Episodic Control [16.94652073521156]
Reinforcement learning (RL) has driven breakthroughs in AI, from game-play to scientific discovery and AI alignment.<n>Recent advances suggest that large language models, with their rich world knowledge and reasoning capabilities, could complement RL by enabling semantic state modeling and task-agnostic planning.<n>We propose the Agentic Episodic Control (AEC), a novel architecture that integrates RL with large language models to enhance decision-making.
arXiv Detail & Related papers (2025-06-02T08:57:37Z) - Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [67.87579664988199]
TON is a two-stage training strategy for vision-language models (VLMs)<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z) - VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model [29.524164786422368]
Recently, DeepSeek R1 has shown that reinforcement learning can substantially improve the reasoning capabilities of Large Language Models (LLMs)<n>We investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs)<n>We develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks.
arXiv Detail & Related papers (2025-04-10T10:05:15Z) - OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z) - GTR: Guided Thought Reinforcement Prevents Thought Collapse in RL-based VLM Agent Training [62.536191233049614]
Reinforcement learning with verifiable outcome rewards (RLVR) has effectively scaled up chain-of-thought (CoT) reasoning in large language models (LLMs)<n>This work investigates this problem through extensive experiments on complex card games, such as 24 points, and embodied tasks from ALFWorld.<n>We find that when rewards are based solely on action outcomes, RL fails to incentivize CoT reasoning in VLMs, instead leading to a phenomenon we termed thought collapse.
arXiv Detail & Related papers (2025-03-11T15:17:02Z) - Reflexive Guidance: Improving OoDD in Vision-Language Models via Self-Guided Image-Adaptive Concept Generation [4.506099292980221]
We evaluate and analyze the OoDD capabilities of various proprietary and open-source LVLMs.<n>We propose a self-guided prompting approach, termed Reflexive Guidance (ReGuide), aimed at enhancing the OoDD capability of LVLMs.<n> Experimental results demonstrate that our ReGuide enhances the performance of current LVLMs in both image classification and OoDD tasks.
arXiv Detail & Related papers (2024-10-19T04:46:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.