Related papers: LENS: Learning to Segment Anything with Unified Reinforced Reasoning

LENS: Learning to Segment Anything with Unified Reinforced Reasoning

URL: http://arxiv.org/abs/2508.14153v1
Date: Tue, 19 Aug 2025 17:59:53 GMT
Title: LENS: Learning to Segment Anything with Unified Reinforced Reasoning
Authors: Lianghui Zhu, Bin Ouyang, Yuxuan Zhang, Tianheng Cheng, Rui Hu, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Li Yu, Wenyu Liu, Xinggang Wang,
Abstract summary: We introduce LENS, a scalable reinforcement-learning framework that jointly optimize the reasoning process and segmentation in an end-to-end manner.<n>LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%.
Score: 38.582392908238866
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision-language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning serves as a robust prior for text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models. Code is available at https://github.com/hustvl/LENS.

Related papers

ECCO: Evidence-Driven Causal Reasoning for Compiler Optimization [9.85275171877854]
We introduce ECCO, a framework that bridges interpretable reasoning with search.<n>We first propose a reverse engineering methodology to construct a Chain-of-Thought dataset.<n>We then design a collaborative inference mechanism where the Large Language Model functions as a strategist.
arXiv Detail & Related papers (2026-01-23T01:23:20Z)
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification? [18.16727716373833]
Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC)<n>We propose ReFine-RFT, a framework that combines ensemble rewards with alg to constrain reasoning length while providing dense accuracy-oriented feedback.
arXiv Detail & Related papers (2026-01-11T17:07:47Z)
A Reasoning Paradigm for Named Entity Recognition [16.86833034216367]
Reasoning framework is proposed for Named Entity Recognition.<n> framework consists of three stages: Chain of Thought (CoT) generation, CoT tuning, and reasoning enhancement.<n>Experiments show ReasoningNER demonstrates impressive cognitive ability in the NER task, achieving competitive performance.
arXiv Detail & Related papers (2025-11-15T01:31:43Z)
Teaching Language Models to Reason with Tools [73.21700643314917]
We present emphHint-Engineering, a new data synthesis strategy that strategically injects diverse hints at optimal points within reasoning paths.<n>CoRT significantly enhances efficiency, reducing token usage by approximately 30% for the 32B model and 50% for the 1.5B model.
arXiv Detail & Related papers (2025-10-23T08:41:44Z)
CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z)
SegDAC: Segmentation-Driven Actor-Critic for Visual Reinforcement Learning [56.73588655252369]
We propose SegDAC, a RL-Driven Actor-Critic method for visual generalization and improved sample efficiency.<n>SegDAC uses Segment Anything (SAM) for object-centric decomposition and YOLO-World to ground segments semantically via text prompts.<n>By evaluating SegDAC over a challenging visual generalization benchmark using Maniskill3, we demonstrate that SegDAC achieves significantly better visual generalization.
arXiv Detail & Related papers (2025-08-12T20:16:54Z)
MSGCoOp: Multiple Semantic-Guided Context Optimization for Few-Shot Learning [0.8249694498830561]
We propose a Multiple Semantic-Guided Context Optimization (MSGCoOp) framework to enhance few-shot generalization.<n>Our approach leverages an ensemble of parallel learnable context vectors to capture diverse semantic aspects.<n>Experiments on 11 benchmark datasets show that MSGCoOp significantly improves performance on base-to-novel generalization.
arXiv Detail & Related papers (2025-07-29T13:15:09Z)
Enhancing Spatial Reasoning in Vision-Language Models via Chain-of-Thought Prompting and Reinforcement Learning [0.42855555838080844]
This study investigates the spatial reasoning capabilities of vision-language models (VLMs) through Chain-of-Thought prompting and reinforcement learning.<n>We find that simple CoT formats, where the model generates a reasoning step before the answer, can harm the model's original performance.<n>In contrast, structured multi-stage prompting based on scene graphs (SceneGraph CoT) significantly improves spatial reasoning accuracy.
arXiv Detail & Related papers (2025-07-06T10:51:12Z)
Seg-R1: Segmentation Can Be Surprisingly Simple with Reinforcement Learning [38.375639439367255]
Seg-R1 is a preliminary exploration of using reinforcement learning to enhance the pixel-level understanding and reasoning capabilities of large multimodal models.<n>We introduce Group Relative Policy Optimization into the segmentation domain, equipping the LMM with pixel-level comprehension.<n>Seg-R1 achieves remarkable performance with purely RL-based training, achieving.873 S-measure on COD10K without complex model modification.
arXiv Detail & Related papers (2025-06-27T20:40:45Z)
Reinforced Latent Reasoning for LLM-based Recommendation [92.56166822197919]
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities in complex problem-solving tasks.<n>Existing methods typically rely on fine-tuning with explicit chain-of-thought (CoT) data.<n>In this work, we explore an alternative approach that shifts from explicit CoT reasoning to compact, information-dense latent reasoning.
arXiv Detail & Related papers (2025-05-25T11:03:45Z)
Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object Detection [11.497620257835964]
We propose CCKT-Det trained without any extra supervision.<n>The proposed framework constructs a cyclic and dynamic knowledge transfer from language queries and visual region features extracted from vision-language models (VLMs)<n> CCKT-Det can consistently improve performance as the scale of VLMs increases, all while requiring the detector at a moderate level of overhead.
arXiv Detail & Related papers (2025-03-14T02:04:28Z)
Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement [52.66700314820547]
Seg-Zero is a novel framework that demonstrates remarkable generalizability and derives explicit chain-of-thought reasoning through cognitive reinforcement.<n>Seg-Zero is trained exclusively via reinforcement learning with GRPO and without explicit reasoning data.<n> Experiments show that Seg-Zero-7B achieves a zero-shot performance of 57.5 on the ReasonSeg benchmark, surpassing the prior LISA-7B by 18%.
arXiv Detail & Related papers (2025-03-09T08:48:51Z)
In-context Demonstration Matters: On Prompt Optimization for Pseudo-Supervision Refinement [71.60563181678323]
Large language models (LLMs) have achieved great success across diverse tasks, and fine-tuning is sometimes needed to further enhance generation quality.<n>To handle these challenges, a direct solution is to generate high-confidence'' data from unsupervised downstream tasks.<n>We propose a novel approach, pseudo-supervised demonstrations aligned prompt optimization (PAPO) algorithm, which jointly refines both the prompt and the overall pseudo-supervision.
arXiv Detail & Related papers (2024-10-04T03:39:28Z)
Gramian Attention Heads are Strong yet Efficient Vision Learners [26.79263390835444]
We introduce a novel architecture design that enhances expressiveness by incorporating multiple head classifiers (ie, classification heads) Our approach employs attention-based aggregation, utilizing pairwise feature similarity to enhance multiple lightweight heads with minimal resource overhead. Our models eventually surpass state-of-the-art CNNs and ViTs regarding the accuracy-grained trade-off on ImageNet-1K.
arXiv Detail & Related papers (2023-10-25T09:08:58Z)
Learning What Not to Segment: A New Perspective on Few-Shot Segmentation [63.910211095033596]
Recently few-shot segmentation (FSS) has been extensively developed. This paper proposes a fresh and straightforward insight to alleviate the problem. In light of the unique nature of the proposed approach, we also extend it to a more realistic but challenging setting.
arXiv Detail & Related papers (2022-03-15T03:08:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.