Related papers: GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation

URL: http://arxiv.org/abs/2602.09701v1
Date: Tue, 10 Feb 2026 11:59:14 GMT
Title: GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation
Authors: Sandesh Hegde, Jaison Saji Chacko, Debarshi Banerjee, Uma Mahesh,
Abstract summary: We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline.<n>A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts.<n>A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline. A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts: a bounding box plus two interior keypoints for every referred instance. A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks. Within our GenSeg-R1 framework we finetune Qwen3-VL models (4B and 8B parameters) using Group Relative Policy Optimization (GRPO), requiring no supervised reasoning-chain annotations. On RefCOCOg validation our best model (GenSeg-R1-8B) achieves 0.7127 cIoU and 0.7382 mIoU, substantially outperforming the corresponding Qwen3-VL Instruct baselines (+15.3 and +21.9 points, respectively) and surpassing Seg-Zero-7B [3] by +3.3 cIoU under identical evaluation. We further introduce GenSeg-R1-G, a variant trained on GRefCOCO [9] with a SAM 2 in-the-loop reward that directly optimizes mask quality. On GRefCOCO validation GenSeg-R1-G achieves 76.69% target mIoU with 82.40% accuracy on negative (no-target) prompts, substantially outperforming Seg-R1-7B and Seg-Zero-7B, which lack no-target detection capability. On ReasonSeg test, GenSeg-R1-4B reaches 68.40% mIoU, surpassing Seg-Zero-7B by +7.0 and Seg-R1-7B by +10.7 points.

Related papers

How Small Can 6G Reason? Scaling Tiny Language Models for AI-Native Networks [3.099103925863002]
We study the scaling behavior and deployment efficiency of compact language models for network-level semantic reasoning in AI-native 6G systems.<n>We evaluate models ranging from 135M (SmolLM2-135M) to 7B parameters (Qwen2.5-7B), including mid-scale architectures such as Llama-3.2-1B, Granite-1B, and Qwen2.5-3B.
arXiv Detail & Related papers (2026-03-02T18:19:49Z)
RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations [52.752467948588816]
We propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations.<n> RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks.<n>Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance
arXiv Detail & Related papers (2025-12-30T06:50:11Z)
LENS: Learning to Segment Anything with Unified Reinforced Reasoning [38.582392908238866]
We introduce LENS, a scalable reinforcement-learning framework that jointly optimize the reasoning process and segmentation in an end-to-end manner.<n>LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%.
arXiv Detail & Related papers (2025-08-19T17:59:53Z)
Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning [38.375639439367255]
Seg-R1 is a preliminary exploration of using reinforcement learning to enhance the pixel-level understanding and reasoning capabilities of large multimodal models.<n>We introduce Group Relative Policy Optimization into the segmentation domain, equipping the LMM with pixel-level comprehension.<n>Seg-R1 achieves remarkable performance with purely RL-based training, achieving.873 S-measure on COD10K without complex model modification.
arXiv Detail & Related papers (2025-06-27T20:40:45Z)
GRPO-CARE: Consistency-Aware Reinforcement Learning for Multimodal Reasoning [53.894789613838654]
We introduce SEED-Bench-R1, a benchmark with complex real-world videos requiring balanced perception and reasoning.<n>Using SEED-Bench-R1, we find that standard GRPO, while improving answer accuracy, often reduces logical coherence between reasoning steps and answers, with only a 57.9% consistency rate.<n>We propose GRPO-CARE, a consistency-aware RL framework optimizing both answer correctness and reasoning coherence without explicit supervision.
arXiv Detail & Related papers (2025-06-19T08:49:13Z)
Understanding R1-Zero-Like Training: A Critical Perspective [73.25430192337235]
We critically examine R1-Zero-like training by analyzing its two core components: base models and RL.<n>We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance.<n>We present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model.
arXiv Detail & Related papers (2025-03-26T17:59:14Z)
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [52.66700314820547]
Seg-Zero is a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement.<n>Seg-Zero is trained exclusively via reinforcement learning with GRPO and without explicit reasoning data.<n> Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18%.
arXiv Detail & Related papers (2025-03-09T08:48:51Z)
SafaRi:Adaptive Sequence Transformer for Weakly Supervised Referring Expression Segmentation [11.243400478302771]
Referring Expression Consistency (RES) aims to provide a segmentation mask of the target object in an image referred to by the text. We propose a weakly-supervised bootstrapping architecture for RES with several new algorithmic innovations.
arXiv Detail & Related papers (2024-07-02T16:02:25Z)
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training [79.27663870280038]
We introduce Contrastive Region Guidance (CRG), a training-free guidance method that enables open-source vision-language models to respond to visual prompts. When region annotations are provided, CRG increases absolute accuracy by up to 11.1% on ViP-Bench. We also show CRG's applicability to spatial reasoning, with 10% improvement on What'sUp.
arXiv Detail & Related papers (2024-03-04T18:55:30Z)
PVG: Progressive Vision Graph for Vision Recognition [48.11440886492801]
We propose a Progressive Vision Graph (PVG) architecture for vision recognition task.<n>PVG contains three main components: 1) Progressively Separated Graph Construction (PSGC), 2) Neighbor nodes information aggregation and update module, and 3) Graph error Linear Unit (GraphLU)
arXiv Detail & Related papers (2023-08-01T14:35:29Z)

This list is automatically generated from the titles and abstracts of the papers in this site.