GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
- URL: http://arxiv.org/abs/2602.09701v1
- Date: Tue, 10 Feb 2026 11:59:14 GMT
- Title: GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
- Authors: Sandesh Hegde, Jaison Saji Chacko, Debarshi Banerjee, Uma Mahesh,
- Abstract summary: We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline.<n>A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts.<n>A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.
Related papers
- How Small Can 6G Reason? Scaling Tiny Language Models for AI-Native Networks [3.099103925863002]
We study the scaling behavior and deployment efficiency of compact language models for network-level semantic reasoning in AI-native 6G systems.<n>We evaluate models ranging from 135M (SmolLM2-135M) to 7B parameters (Qwen2.5-7B), including mid-scale architectures such as Llama-3.2-1B, Granite-1B, and Qwen2.5-3B.
arXiv Detail & Related papers (2026-03-02T18:19:49Z) - RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations [52.752467948588816]
We propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations.<n> RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks.<n>Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance
arXiv Detail & Related papers (2025-12-30T06:50:11Z) - LENS: Learning to Segment Anything with Unified Reinforced Reasoning [38.582392908238866]
We introduce LENS, a scalable reinforcement-learning framework that jointly optimize the reasoning process and segmentation in an end-to-end manner.<n>LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%.
arXiv Detail & Related papers (2025-08-19T17:59:53Z) - Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning [38.375639439367255]
Seg-R1 is a preliminary exploration of using reinforcement learning to enhance the pixel-level understanding and reasoning capabilities of large multimodal models.<n>We introduce Group Relative Policy Optimization into the segmentation domain, equipping the LMM with pixel-level comprehension.<n>Seg-R1 achieves remarkable performance with purely RL-based training, achieving.873 S-measure on COD10K without complex model modification.
arXiv Detail & Related papers (2025-06-27T20:40:45Z) - GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z) - Understanding R1-Zero-Like Training: A Critical Perspective [73.25430192337235]
We critically examine R1-Zero-like training by analyzing its two core components: base models and RL.<n>We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance.<n>We present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model.
arXiv Detail & Related papers (2025-03-26T17:59:14Z) - Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [52.66700314820547]
Seg-Zero is a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement.<n>Seg-Zero is trained exclusively via reinforcement learning with GRPO and without explicit reasoning data.<n> Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18%.
arXiv Detail & Related papers (2025-03-09T08:48:51Z) - SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation [11.243400478302771]
Referring Expression Consistency (RES) aims to provide a segmentation mask of the target object in an image referred to by the text.
We propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations.
arXiv Detail & Related papers (2024-07-02T16:02:25Z) - Contrastive Region Guidance: Improving Grounding in Vision-Language
Models without Training [79.27663870280038]
We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source vision-language models to respond to visual prompts.
When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench.
We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp.
arXiv Detail & Related papers (2024-03-04T18:55:30Z) - PVG: Progressive Vision Graph for Vision Recognition [48.11440886492801]
We propose a Progressive Vision Graph (PVG) architecture for vision recognition task.<n>PVG contains three main components: 1) Progressively Separated Graph Construction (PSGC), 2) Neighbor nodes information aggregation and update module, and 3) Graph error Linear Unit (GraphLU)
arXiv Detail & Related papers (2023-08-01T14:35:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.