Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning
- URL: http://arxiv.org/abs/2506.22624v1
- Date: Fri, 27 Jun 2025 20:40:45 GMT
- Title: Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning
- Authors: Zuyao You, Zuxuan Wu,
- Abstract summary: Seg-R1 is a preliminary exploration of using reinforcement learning to enhance the pixel-level understanding and reasoning capabilities of large multimodal models.<n>We introduce Group Relative Policy Optimization into the segmentation domain, equipping the LMM with pixel-level comprehension.<n>Seg-R1 achieves remarkable performance with purely RL-based training, achieving.873 S-measure on COD10K without complex model modification.
- Score: 38.375639439367255
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We present Seg-R1, a preliminary exploration of using reinforcement learning (RL) to enhance the pixel-level understanding and reasoning capabilities of large multimodal models (LMMs). Starting with foreground segmentation tasks, specifically camouflaged object detection (COD) and salient object detection (SOD), our approach enables the LMM to generate point and bounding box prompts in the next-token fashion, which are then used to guide SAM2 in producing segmentation masks. We introduce Group Relative Policy Optimization (GRPO) into the segmentation domain, equipping the LMM with pixel-level comprehension through a carefully designed training strategy. Notably, Seg-R1 achieves remarkable performance with purely RL-based training, achieving .873 S-measure on COD10K without complex model modification. Moreover, we found that pure RL training demonstrates strong open-world generalization. Despite being trained solely on foreground segmentation image-mask pairs without text supervision, Seg-R1 achieves impressive zero-shot performance on referring segmentation and reasoning segmentation tasks, with 71.4 cIoU on RefCOCOg test and 56.7 gIoU on ReasonSeg test, outperforming models fully supervised on these datasets.
Related papers
- RSAgent: Learning to Reason and Act for Text-Guided Segmentation via Multi-Turn Tool Invocations [52.752467948588816]
We propose RSAgent, an agentic Multimodal Large Language Model (MLLM) which interleaves reasoning and action for segmentation via multi-turn tool invocations.<n> RSAgent queries a segmentation toolbox, observes visual feedback, and revises its spatial hypothesis using historical observations to re-localize targets and iteratively refine masks.<n>Extensive experiments show that RSAgent achieves a zero-shot performance of 66.5% gIoU on ReasonSeg test, improving over Seg-Zero-7B by 9%, and reaches 81.5% cIoU on RefCOCOg, demonstrating state-of-the-art performance
arXiv Detail & Related papers (2025-12-30T06:50:11Z) - FineRS: Fine-grained Reasoning and Segmentation of Small Objects with Reinforcement Learning [62.11389260206383]
textscFineRS is a two-stage MLLM-based reinforcement learning framework for segmenting extremely small objects.<n>We present textscFineRS-4k, a new dataset for evaluating MLLMs on attribute-level reasoning and pixel-level segmentation on subtle, small-scale targets.
arXiv Detail & Related papers (2025-10-24T10:14:17Z) - First RAG, Second SEG: A Training-Free Paradigm for Camouflaged Object Detection [14.070196423996045]
Existing approaches often rely on heavy training and large computational resources.<n>We propose RAG-SEG, a training-free paradigm that decouples COD into two stages: Retrieval-Augmented Generation (RAG) for generating coarse masks as prompts, followed by SAM-based segmentation (SEG) for refinement.<n>RAG-SEG constructs a compact retrieval database via unsupervised clustering, enabling fast and effective feature retrieval.<n>Experiments on benchmark COD datasets demonstrate that RAG-SEG performs on par with or surpasses state-of-the-art methods.
arXiv Detail & Related papers (2025-08-21T07:14:18Z) - LENS: Learning to Segment Anything with Unified Reinforced Reasoning [38.582392908238866]
We introduce LENS, a scalable reinforcement-learning framework that jointly optimize the reasoning process and segmentation in an end-to-end manner.<n>LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%.
arXiv Detail & Related papers (2025-08-19T17:59:53Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - Segment Concealed Objects with Incomplete Supervision [63.637733655439334]
Incompletely-Supervised Concealed Object (ISCOS) involves segmenting objects that seamlessly blend into their surrounding environments.<n>This task remains highly challenging due to the limited supervision provided by the incompletely annotated training data.<n>In this paper, we introduce the first unified method for ISCOS to address these challenges.
arXiv Detail & Related papers (2025-06-10T16:25:15Z) - SAM-R1: Leveraging SAM for Reward Feedback in Multimodal Segmentation via Reinforcement Learning [26.167394979565454]
We propose SAM-R1, a novel framework that enables multimodal large models to perform fine-grained reasoning in image understanding tasks.<n>Our approach is the first to incorporate fine-grained segmentation settings during the training of multimodal reasoning models.<n>With only 3k training samples, SAM-R1 achieves strong performance across multiple benchmarks.
arXiv Detail & Related papers (2025-05-28T17:08:28Z) - Cross-Modal Consistency Learning for Sign Language Recognition [92.44927164283641]
Existing pre-training methods solely focus on the compact pose data.<n>We propose a Cross-modal Consistency Learning framework (CCL- SLR)<n>CCL- SLR learns from both RGB and pose modalities based on self-supervised pre-training.
arXiv Detail & Related papers (2025-03-16T12:34:07Z) - CLISC: Bridging clip and sam by enhanced cam for unsupervised brain tumor segmentation [6.438259303569066]
A vision-language model (i.e., CLIP) is employed to obtain image-level pseudo-labels for training a classification network.<n>A 3D segmentation network is trained with the SAM-derived pseudo-labels, where low-quality pseudo-labels are filtered out in a self-learning process.<n>Our approach obtained an average Dice Similarity Score (DSC) of 85.60%, outperforming five state-of-the-art unsupervised segmentation methods by more than 10 percentage points.
arXiv Detail & Related papers (2025-01-27T17:43:51Z) - PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model [49.80313655590392]
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges.
It incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks.
The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization.
arXiv Detail & Related papers (2024-03-21T17:50:47Z) - CLIP Is Also a Good Teacher: A New Learning Framework for Inductive
Zero-shot Semantic Segmentation [6.181169909576527]
Generalized Zero-shot Semantic aims to segment both seen and unseen categories only under the supervision of the seen ones.
Existing methods adopt the large-scale Vision Language Models (VLMs) which obtain outstanding zero-shot performance.
We propose CLIP-ZSS (Zero-shot Semantic), a training framework that enables any image encoder designed for closed-set segmentation applied in zero-shot and open-vocabulary tasks.
arXiv Detail & Related papers (2023-10-03T09:33:47Z) - USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text
Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality.
Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z) - IDEAL: Improved DEnse locAL Contrastive Learning for Semi-Supervised
Medical Image Segmentation [3.6748639131154315]
We extend the concept of metric learning to the segmentation task.
We propose a simple convolutional projection head for obtaining dense pixel-level features.
A bidirectional regularization mechanism involving two-stream regularization training is devised for the downstream task.
arXiv Detail & Related papers (2022-10-26T23:11:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.