Related papers: Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

URL: http://arxiv.org/abs/2503.06520v2
Date: Sat, 28 Jun 2025 11:01:08 GMT
Title: Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement
Authors: Yuqi Liu, Bohao Peng, Zhisheng Zhong, Zihao Yue, Fanbin Lu, Bei Yu, Jiaya Jia,
Abstract summary: Seg-Zero is a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement.<n>Seg-Zero is trained exclusively via reinforcement learning with GRPO and without explicit reasoning data.<n> Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18%.
Score: 52.66700314820547
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Traditional methods for reasoning segmentation rely on supervised fine-tuning with categorical labels and simple descriptions, limiting its out-of-domain generalization and lacking explicit reasoning processes. To address these limitations, we propose Seg-Zero, a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement. Seg-Zero introduces a decoupled architecture consisting of a reasoning model and a segmentation model. The reasoning model interprets user intentions, generates explicit reasoning chains, and produces positional prompts, which are subsequently used by the segmentation model to generate precious pixel-level masks. We design a sophisticated reward mechanism that integrates both format and accuracy rewards to effectively guide optimization directions. Trained exclusively via reinforcement learning with GRPO and without explicit reasoning data, Seg-Zero achieves robust zero-shot generalization and exhibits emergent test-time reasoning capabilities. Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18\%. This significant improvement highlights Seg-Zero's ability to generalize across domains while presenting an explicit reasoning process. Code is available at https://github.com/dvlab-research/Seg-Zero.

Related papers

GenSeg-R1: RL-Driven Vision-Language Grounding for Fine-Grained Referring Segmentation [0.0]
We study fine-grained referring image segmentation via a decoupled reason-then-segment pipeline.<n>A vision-language model (VLM) receives an image and a natural-language query, reasons about the scene, and emits structured spatial prompts.<n>A frozen promptable segmenter (SAM 2) converts these prompts into high-quality masks.
arXiv Detail & Related papers (2026-02-10T11:59:14Z)
How Does Prefix Matter in Reasoning Model Tuning? [57.69882799751655]
We fine-tune three R1 series models across three core model capabilities: reasoning (mathematics), coding, safety, and factuality.<n>Results show that prefix-conditioned SFT improves both safety and reasoning performance, yielding up to +6% higher Safe@1 accuracy.
arXiv Detail & Related papers (2026-01-04T18:04:23Z)
A Reasoning Paradigm for Named Entity Recognition [16.86833034216367]
Reasoning framework is proposed for Named Entity Recognition.<n> framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement.<n>Experiments show ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance.
arXiv Detail & Related papers (2025-11-15T01:31:43Z)
In-Token Rationality Optimization: Towards Accurate and Concise LLM Reasoning via Self-Feedback [38.915062716409686]
InTRO is a new framework that enables both token-level exploration and self-feedback for accurate and concise reasoning.<n>InTRO consistently outperforms other baselines, raising solution accuracy by up to 20% relative to the base model.<n>Its chains of thought are notably more concise, exhibiting reduced verbosity.
arXiv Detail & Related papers (2025-11-13T01:47:06Z)
LENS: Learning to Segment Anything with Unified Reinforced Reasoning [38.582392908238866]
We introduce LENS, a scalable reinforcement-learning framework that jointly optimize the reasoning process and segmentation in an end-to-end manner.<n>LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%.
arXiv Detail & Related papers (2025-08-19T17:59:53Z)
Reinforcing Video Reasoning Segmentation to Think Before It Segments [67.5703457389657]
We introduce Veason-R1, a specialized LVLM for video reasoning segmentation.<n>Veason-R1 is trained through Group Relative Policy Optimization (O) augmented with Chain-of-Thought trajectories.<n>We incorporate a holistic reward mechanism that enhances spatial alignment and temporal consistency.<n>Veason-R1 achieves state-of-the-art performance on multiple benchmarks, surpassing prior art by significant margins.
arXiv Detail & Related papers (2025-08-15T15:34:56Z)
Open-world Point Cloud Semantic Segmentation: A Human-in-the-loop Framework [8.451270206964534]
Open-world point cloud semantic segmentation (OW-Seg) aims to predict point labels of both base and novel classes in real-world scenarios.<n>We propose HOW-Seg, the first human-in-the-loop framework for OW-Seg.<n>By leveraging sparse human annotations as guidance, HOW-Seg enables prototype-based segmentation for both base and novel classes.
arXiv Detail & Related papers (2025-08-07T01:20:41Z)
Abstract, Align, Predict: Zero-Shot Stance Detection via Cognitive Inductive Reasoning [6.709126599208497]
Zero-shot stance detection (ZSSD) aims to identify the stance of text toward previously unseen targets.<n>Inspired by human cognitive reasoning, we propose the Cognitive Inductive Reasoning Framework (CIRF)<n>Experiments on SemEval-2016, VAST, and COVID-19-Stance benchmarks show that CIRF establishes new state-of-the-art results.
arXiv Detail & Related papers (2025-06-16T13:28:37Z)
Writing-Zero: Bridge the Gap Between Non-verifiable Tasks and Verifiable Rewards [11.149294285483782]
We propose a unified RLVR-based training paradigm that bridges the gap between non-verifiable tasks and verifiable rewards.<n>We introduce a writing-principle-based pairwise Generative Reward Model (GenRM) and a novel Bootstrapped Relative Policy Optimization (BRPO) algorithm.<n>Our approach empowers LLMs to develop robust writing capabilities without supervised fine-tuning.
arXiv Detail & Related papers (2025-05-30T14:34:57Z)
PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z)
SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism. Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance. We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z)
Text-Guided Attention is All You Need for Zero-Shot Robustness in Vision-Language Models [64.67721492968941]
We propose a Text-Guided Attention for Zero-Shot Robustness (TGA-ZSR) framework. Our goal is to maintain the generalization of the CLIP model and enhance its adversarial robustness. Our method yields a 9.58% enhancement in zero-shot robust accuracy over the current state-of-the-art techniques.
arXiv Detail & Related papers (2024-10-29T07:15:09Z)
AlignZeg: Mitigating Objective Misalignment for Zero-shot Semantic Segmentation [123.88875931128342]
A serious issue that harms the performance of zero-shot visual recognition is named objective misalignment. We propose a novel architecture named AlignZeg, which embodies a comprehensive improvement of the segmentation pipeline. Experiments demonstrate that AlignZeg markedly enhances zero-shot semantic segmentation.
arXiv Detail & Related papers (2024-04-08T16:51:33Z)
Beyond Pixels: Enhancing LIME with Hierarchical Features and Segmentation Foundation Models [2.355460994057843]
LIME is a popular XAI framework for unraveling decision-making processes in vision machine-learning models.<n>We introduce the DSEG-LIME (Data-Driven LIME) framework, featuring a data-driven segmentation for human-recognized feature generation.<n>Our findings demonstrate that DSEG outperforms on several XAI metrics on pre-trained ImageNet models.
arXiv Detail & Related papers (2024-03-12T15:13:12Z)
HierarchicalContrast: A Coarse-to-Fine Contrastive Learning Framework for Cross-Domain Zero-Shot Slot Filling [4.1940152307593515]
Cross-domain zero-shot slot filling plays a vital role in leveraging source domain knowledge to learn a model. Existing state-of-the-art zero-shot slot filling methods have limited generalization ability in target domain. We present a novel Hierarchical Contrastive Learning Framework (HiCL) for zero-shot slot filling.
arXiv Detail & Related papers (2023-10-13T14:23:33Z)
Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z)
Self-Ensembling GAN for Cross-Domain Semantic Segmentation [107.27377745720243]
This paper proposes a self-ensembling generative adversarial network (SE-GAN) exploiting cross-domain data for semantic segmentation. In SE-GAN, a teacher network and a student network constitute a self-ensembling model for generating semantic segmentation maps, which together with a discriminator, forms a GAN. Despite its simplicity, we find SE-GAN can significantly boost the performance of adversarial training and enhance the stability of the model.
arXiv Detail & Related papers (2021-12-15T09:50:25Z)
Zero-Shot Semantic Segmentation via Spatial and Multi-Scale Aware Visual Class Embedding [0.0]
We propose a language-model-free zero-shot semantic segmentation framework, Spatial and Multi-scale aware Visual Class Embedding Network (SM-VCENet) In experiments, our SM-VCENet outperforms zero-shot semantic segmentation state-of-the-art by a relative margin.
arXiv Detail & Related papers (2021-11-30T07:39:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.