UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning
- URL: http://arxiv.org/abs/2508.11196v1
- Date: Fri, 15 Aug 2025 04:06:40 GMT
- Title: UAV-VL-R1: Generalizing Vision-Language Models via Supervised Fine-Tuning and Multi-Stage GRPO for UAV Visual Reasoning
- Authors: Jiajin Guan, Haibo Mei, Bonan Zhang, Dan Liu, Yuanshuang Fu, Yue Zhang,
- Abstract summary: We propose UAV-VL-R1, a lightweight vision-language model specifically designed for aerial visual reasoning tasks.<n>It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL)<n>We show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant.
- Score: 11.872945853854628
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in vision-language models (VLMs) have demonstrated strong generalization in natural image tasks. However, their performance often degrades on unmanned aerial vehicle (UAV)-based aerial imagery, which features high resolution, complex spatial semantics, and strict real-time constraints. These challenges limit the applicability of general-purpose VLMs to structured aerial reasoning tasks. To address these challenges, we propose UAV-VL-R1, a lightweight VLM explicitly designed for aerial visual reasoning. It is trained using a hybrid method that combines supervised fine-tuning (SFT) and multi-stage reinforcement learning (RL). We leverage the group relative policy optimization (GRPO) algorithm to promote structured and interpretable reasoning through rule-guided rewards and intra-group policy alignment. To support model training and evaluation, we introduce a high-resolution visual question answering dataset named HRVQA-VL, which consists of 50,019 annotated samples covering eight UAV-relevant reasoning tasks, including object counting, transportation recognition, and spatial scene inference. Experimental results show that UAV-VL-R1 achieves a 48.17% higher zero-shot accuracy than the Qwen2-VL-2B-Instruct baseline and even outperforms its 72B-scale variant, which is 36x larger, on multiple tasks. Ablation studies reveal that while SFT improves semantic alignment, it may reduce reasoning diversity in mathematical tasks. GRPO-based RL compensates for this limitation by enhancing logical flexibility and the robustness of inference. Additionally, UAV-VL-R1 requires only 3.9GB of memory under FP16 inference and can be quantized to 2.5GB with INT8, supporting real-time deployment on resource-constrained UAV platforms.
Related papers
- STVG-R1: Incentivizing Instance-Level Reasoning and Grounding in Videos via Reinforcement Learning [65.36458157092207]
In vision-language models (VLMs), misalignment between textual descriptions and visual coordinates often induces hallucinations.<n>We propose a novel visual prompting paradigm that avoids the difficult problem of aligning coordinates across modalities.<n>We introduce STVG-R1, the first reinforcement learning framework for STVG, which employs a task-driven reward to jointly optimize temporal accuracy, spatial consistency, and structural format regularization.
arXiv Detail & Related papers (2026-02-12T08:53:32Z) - Enhancing Radiology Report Generation and Visual Grounding using Reinforcement Learning [15.894854593567963]
Reinforcement learning can incorporate task-specific feedback, and its combination with explicit intermediate reasoning ("thinking") has demonstrated substantial gains on verifiable math and coding tasks.<n>We build an updated vision-language model based on Qwen3-VL, followed by a cold-start SFT stage that equips the model with basic thinking ability.<n>We find that while strong SFT remains crucial for high base performance, RL provides additional gains on both tasks, whereas explicit thinking does not appear to further improve results.
arXiv Detail & Related papers (2025-12-11T14:36:14Z) - BasicAVSR: Arbitrary-Scale Video Super-Resolution via Image Priors and Enhanced Motion Compensation [70.27358326228399]
We propose a BasicAVSR for Arbitrary-scale video super-resolution (AVSR)<n>AVSR aims to enhance the resolution of video frames, potentially various scaling factors.<n>We show that BasicAVSR significantly outperforms existing methods in terms of super-resolution quality, generalization ability, and inference speed.
arXiv Detail & Related papers (2025-10-30T05:08:45Z) - Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization [61.55616421408666]
Low-Altitude Economy Networks (LAENets) have enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection.<n> onboard vision (VLMs) offer inference for real-time inference but limited onboard dynamic network conditions.<n>We propose a UAV-enabled LAENet system that improves communication efficiency under dynamic LAENet conditions.
arXiv Detail & Related papers (2025-10-11T05:11:21Z) - Light-Weight Cross-Modal Enhancement Method with Benchmark Construction for UAV-based Open-Vocabulary Object Detection [6.443926939309045]
We propose a complete UAV-oriented solution that combines both dataset construction and model innovation.<n>First, we design a refined UAV-Label Engine, which efficiently resolves annotation redundancy, inconsistency, and ambiguity.<n>Second, we introduce the Cross-Attention Gated Enhancement (CAGE) module, a lightweight dual-path fusion design that integrates cross-attention, adaptive gating, and global FiLM modulation for robust textvision alignment.
arXiv Detail & Related papers (2025-09-07T10:59:02Z) - LLM Meets the Sky: Heuristic Multi-Agent Reinforcement Learning for Secure Heterogeneous UAV Networks [57.27815890269697]
This work focuses on maximizing the secrecy rate in heterogeneous UAV networks (HetUAVNs) under energy constraints.<n>We introduce a Large Language Model (LLM)-guided multi-agent learning approach.<n>Results show that our method outperforms existing baselines in secrecy and energy efficiency.
arXiv Detail & Related papers (2025-07-23T04:22:57Z) - VL-GenRM: Enhancing Vision-Language Verification via Vision Experts and Iterative Training [23.391643634478587]
Vision-Language Reward Model (VL-RM) is key to aligning VL models by providing structured feedback.<n> bootstrapping dilemma arises as high-quality training data depends on already strong VL models.<n>We propose an iterative training framework leveraging vision experts, Chain-of-Thought rationales, and Margin-based Rejection Sampling.
arXiv Detail & Related papers (2025-06-16T18:10:51Z) - SVQA-R1: Reinforcing Spatial Reasoning in MLLMs via View-Consistent Reward Optimization [57.484274282231226]
We propose SVQA-R1, the first framework to extend R1-style training to spatial VQA.<n>In particular, we introduce Spatial-GRPO, a novel group-wise RL strategy that constructs view-consistent rewards by perturbing spatial relations between objects.<n>Our model, SVQA-R1, not only dramatically improved accuracy on spatial VQA benchmarks but also exhibits interpretable reasoning paths even without using supervised fine-tuning data.
arXiv Detail & Related papers (2025-06-02T06:58:43Z) - FlightGPT: Towards Generalizable and Interpretable UAV Vision-and-Language Navigation with Vision-Language Models [11.286340789648813]
Unmanned Aerial Vehicle (UAV) Vision-and-Language Navigation (VLN) is vital for applications such as disaster response, logistics delivery, and urban inspection.<n>We propose FlightGPT, a novel UAV VLN framework built upon Vision-Language Models (VLMs) with powerful multimodal perception capabilities.<n>We show that FlightGPT achieves state-of-the-art performance across all scenarios, with a 9.22% higher success rate than the strongest baseline in unseen environments.
arXiv Detail & Related papers (2025-05-19T08:21:20Z) - VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models [63.27511432647797]
We propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes.<n>We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V.
arXiv Detail & Related papers (2024-12-02T18:58:25Z) - VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models [66.56298924208319]
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems.<n>Current assessment methods primarily rely on AI-annotated preference labels from traditional tasks.<n>We introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks.
arXiv Detail & Related papers (2024-11-26T14:08:34Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.