CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation
- URL: http://arxiv.org/abs/2511.17755v1
- Date: Fri, 21 Nov 2025 20:14:55 GMT
- Title: CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation
- Authors: Prantik Howlader, Hoang Nguyen-Canh, Srijan Das, Jingyi Xu, Hieu Le, Dimitris Samaras,
- Abstract summary: Reasoning segmentation seeks pixel-accurate masks for targets referenced by complex, often implicit instructions.<n>We present CORA, a semi-supervised reasoning segmentation framework that jointly learns from limited labeled data and a large corpus of unlabeled images.<n> CORA achieves state-of-the-art results, requiring as few as 100 labeled images on Cityscapes, a benchmark dataset for urban scene understanding.
- Score: 54.53371540755023
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reasoning segmentation seeks pixel-accurate masks for targets referenced by complex, often implicit instructions, requiring context-dependent reasoning over the scene. Recent multimodal language models have advanced instruction following segmentation, yet generalization remains limited. The key bottleneck is the high cost of curating diverse, high-quality pixel annotations paired with rich linguistic supervision leading to brittle performance under distribution shift. Therefore, we present CORA, a semi-supervised reasoning segmentation framework that jointly learns from limited labeled data and a large corpus of unlabeled images. CORA introduces three main components: 1) conditional visual instructions that encode spatial and contextual relationships between objects; 2) a noisy pseudo-label filter based on the consistency of Multimodal LLM's outputs across semantically equivalent queries; and 3) a token-level contrastive alignment between labeled and pseudo-labeled samples to enhance feature consistency. These components enable CORA to perform robust reasoning segmentation with minimal supervision, outperforming existing baselines under constrained annotation settings. CORA achieves state-of-the-art results, requiring as few as 100 labeled images on Cityscapes, a benchmark dataset for urban scene understanding, surpassing the baseline by $+2.3\%$. Similarly, CORA improves performance by $+2.4\%$ with only 180 labeled images on PanNuke, a histopathology dataset.
Related papers
- PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation [5.862480696321742]
Weakly supervised learning with scribble annotations uses sparse user-drawn strokes to indicate segmentation labels on a small subset of pixels.<n>Recent scribble-based approaches in medical image segmentation address this limitation using pseudo-label-based training.<n>We propose PLESS, a generic pseudo-label enhancement strategy which improves reliability and spatial consistency.
arXiv Detail & Related papers (2026-02-12T06:24:05Z) - PANC: Prior-Aware Normalized Cut for Object Segmentation [0.0]
We propose a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens.<n>We report strong results on homogeneous, fine-grained, and texture-limited domains.<n>For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation.
arXiv Detail & Related papers (2026-02-06T18:07:20Z) - DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation [12.044632781901088]
Weakly supervised 3D instance segmentation is essential for 3D scene understanding.<n>Existing methods rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations.<n>We propose textbfDBGroup, a two-stage weakly supervised 3D instance segmentation framework.
arXiv Detail & Related papers (2025-11-13T06:12:13Z) - LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [54.683384204063934]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z) - HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment [16.926158907882012]
We propose a unified Vision-Language framework that integrates domain-invariant text embeddings as object queries in a transformer-based segmentation network.<n>Our results show that language-guided segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.
arXiv Detail & Related papers (2025-06-16T19:05:33Z) - Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts [64.93416171745693]
ThinkFirst is a training-free reasoning segmentation framework.<n>Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image.<n>This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process.
arXiv Detail & Related papers (2025-03-10T16:26:11Z) - Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation [15.941958367737408]
Seg-TTO is a framework for zero-shot, open-vocabulary semantic segmentation.<n>We focus on segmentation-specific test-time optimization to address this gap.<n>Seg-TTO demonstrates clear performance improvements (up to 27% mIoU increase on some datasets) establishing new state-of-the-art.
arXiv Detail & Related papers (2025-01-08T18:58:24Z) - From Few to More: Scribble-based Medical Image Segmentation via Masked Context Modeling and Continuous Pseudo Labels [46.949484720513674]
We propose MaCo, a weakly supervised model designed for medical image segmentation.<n>We evaluate MaCo on three public datasets, comparing it with other weakly supervised methods.
arXiv Detail & Related papers (2024-08-23T03:19:20Z) - African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification [53.89380284760555]
textttFOCI (textbfFine-grained textbfObject textbfClasstextbfIfication) is a difficult multiple-choice benchmark for fine-grained object classification.
textttFOCIxspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k.
arXiv Detail & Related papers (2024-06-20T16:59:39Z) - Pointly-Supervised Panoptic Segmentation [106.68888377104886]
We propose a new approach to applying point-level annotations for weakly-supervised panoptic segmentation.
Instead of the dense pixel-level labels used by fully supervised methods, point-level labels only provide a single point for each target as supervision.
We formulate the problem in an end-to-end framework by simultaneously generating panoptic pseudo-masks from point-level labels and learning from them.
arXiv Detail & Related papers (2022-10-25T12:03:51Z) - Unsupervised Semantic Segmentation by Distilling Feature Correspondences [94.73675308961944]
Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation.
We present STEGO, a novel framework that distills unsupervised features into high-quality discrete semantic labels.
STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff and Cityscapes challenges.
arXiv Detail & Related papers (2022-03-16T06:08:47Z) - Towards Single Stage Weakly Supervised Semantic Segmentation [2.28438857884398]
We present a single-stage approach to weakly supervised semantic segmentation.
We use point annotations to generate reliable, on-the-fly pseudo-masks.
We significantly outperform other SOTA WSSS methods on recent real-world datasets.
arXiv Detail & Related papers (2021-06-18T18:34:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.