Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
- URL: http://arxiv.org/abs/2503.06520v2
- Date: Sat, 28 Jun 2025 11:01:08 GMT
- Title: Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
- Authors: Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia,
- Abstract summary: Seg-Zero is a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement.<n>Seg-Zero is trained exclusively via reinforcement learning with GRPO and without explicit reasoning data.<n> Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18%.
- Score: 52.66700314820547
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at https://github.com/dvlab-research/Seg-Zero.
Related papers
- GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation [0.0]
We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline.<n>A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts.<n>A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks.
arXiv Detail & Related papers (2026-02-10T11:59:14Z) - How Does Prefix Matter in Reasoning Model Tuning? [57.69882799751655]
We fine-tune three R1 series models across three core model capabilities: reasoning (mathematics), coding, safety, and factuality.<n>Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy.
arXiv Detail & Related papers (2026-01-04T18:04:23Z) - A Reasoning Paradigm for Named Entity Recognition [16.86833034216367]
Reasoning framework is proposed for Named Entity Recognition.<n> framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement.<n>Experiments show ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance.
arXiv Detail & Related papers (2025-11-15T01:31:43Z) - In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback [38.915062716409686]
InTRO is a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning.<n>InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model.<n>Its chains of thought are notably more concise, exhibiting reduced verbosity.
arXiv Detail & Related papers (2025-11-13T01:47:06Z) - LENS: Learning to Segment Anything with Unified Reinforced Reasoning [38.582392908238866]
We introduce LENS, a scalable reinforcement-learning framework that jointly optimize the reasoning process and segmentation in an end-to-end manner.<n>LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%.
arXiv Detail & Related papers (2025-08-19T17:59:53Z) - Reinforcing Video Reasoning Segmentation to Think Before It Segments [67.5703457389657]
We introduce Veason-R1, a specialized LVLM for video reasoning segmentation.<n>Veason-R1 is trained through Group Relative Policy Optimization (O) augmented with Chain-of-Thought trajectories.<n>We incorporate a holistic reward mechanism that enhances spatial alignment and temporal consistency.<n>Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins.
arXiv Detail & Related papers (2025-08-15T15:34:56Z) - Open-world Point Cloud Semantic Segmentation: A Human-in-the-loop Framework [8.451270206964534]
Open-world point cloud semantic segmentation (OW-Seg) aims to predict point labels of both base and novel classes in real-world scenarios.<n>We propose HOW-Seg, the first human-in-the-loop framework for OW-Seg.<n>By leveraging sparse human annotations as guidance, HOW-Seg enables prototype-based segmentation for both base and novel classes.
arXiv Detail & Related papers (2025-08-07T01:20:41Z) - Abstract, Align, Predict: Zero-Shot Stance Detection via Cognitive Inductive Reasoning [6.709126599208497]
Zero-shot stance detection (ZSSD) aims to identify the stance of text toward previously unseen targets.<n>Inspired by human cognitive reasoning, we propose the Cognitive Inductive Reasoning Framework (CIRF)<n>Experiments on SemEval-2016, VAST, and COVID-19-Stance benchmarks show that CIRF establishes new state-of-the-art results.
arXiv Detail & Related papers (2025-06-16T13:28:37Z) - Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards [11.149294285483782]
We propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards.<n>We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm.<n>Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning.
arXiv Detail & Related papers (2025-05-30T14:34:57Z) - PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.
Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.
We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z) - Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models [64.67721492968941]
We propose a Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR) framework.
Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness.
Our method yields a 9.58% enhancement in zero-shot robust accuracy over the current state-of-the-art techniques.
arXiv Detail & Related papers (2024-10-29T07:15:09Z) - AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation [123.88875931128342]
A serious issue that harms the performance of zero-shot visual recognition is named objective misalignment.
We propose a novel architecture named AlignZeg, which embodies a comprehensive improvement of the segmentation pipeline.
Experiments demonstrate that AlignZeg markedly enhances zero-shot semantic segmentation.
arXiv Detail & Related papers (2024-04-08T16:51:33Z) - Beyond Pixels: Enhancing LIME with Hierarchical Features and Segmentation Foundation Models [2.355460994057843]
LIME is a popular XAI framework for unraveling decision-making processes in vision machine-learning models.<n>We introduce the DSEG-LIME (Data-Driven LIME) framework, featuring a data-driven segmentation for human-recognized feature generation.<n>Our findings demonstrate that DSEG outperforms on several XAI metrics on pre-trained ImageNet models.
arXiv Detail & Related papers (2024-03-12T15:13:12Z) - HierarchicalContrast: A Coarse-to-Fine Contrastive Learning Framework
for Cross-Domain Zero-Shot Slot Filling [4.1940152307593515]
Cross-domain zero-shot slot filling plays a vital role in leveraging source domain knowledge to learn a model.
Existing state-of-the-art zero-shot slot filling methods have limited generalization ability in target domain.
We present a novel Hierarchical Contrastive Learning Framework (HiCL) for zero-shot slot filling.
arXiv Detail & Related papers (2023-10-13T14:23:33Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - Self-Ensembling GAN for Cross-Domain Semantic Segmentation [107.27377745720243]
This paper proposes a self-ensembling generative adversarial network (SE-GAN) exploiting cross-domain data for semantic segmentation.
In SE-GAN, a teacher network and a student network constitute a self-ensembling model for generating semantic segmentation maps, which together with a discriminator, forms a GAN.
Despite its simplicity, we find SE-GAN can significantly boost the performance of adversarial training and enhance the stability of the model.
arXiv Detail & Related papers (2021-12-15T09:50:25Z) - Zero-Shot Semantic Segmentation via Spatial and Multi-Scale Aware Visual
Class Embedding [0.0]
We propose a language-model-free zero-shot semantic segmentation framework, Spatial and Multi-scale aware Visual Class Embedding Network (SM-VCENet)
In experiments, our SM-VCENet outperforms zero-shot semantic segmentation state-of-the-art by a relative margin.
arXiv Detail & Related papers (2021-11-30T07:39:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.