Related papers: CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation

CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation

URL: http://arxiv.org/abs/2511.17755v1
Date: Fri, 21 Nov 2025 20:14:55 GMT
Title: CORA: Consistency-Guided Semi-Supervised Framework for Reasoning Segmentation
Authors: Prantik Howlader, Hoang Nguyen-Canh, Srijan Das, Jingyi Xu, Hieu Le, Dimitris Samaras,
Abstract summary: Reasoning segmentation seeks pixel-accurate masks for targets referenced by complex, often implicit instructions.<n>We present CORA, a semi-supervised reasoning segmentation framework that jointly learns from limited labeled data and a large corpus of unlabeled images.<n> CORA achieves state-of-the-art results, requiring as few as 100 labeled images on Cityscapes, a benchmark dataset for urban scene understanding.
Score: 54.53371540755023
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Reasoning segmentation seeks pixel-accurate masks for targets referenced by complex, often implicit instructions, requiring context-dependent reasoning over the scene. Recent multimodal language models have advanced instruction following segmentation, yet generalization remains limited. The key bottleneck is the high cost of curating diverse, high-quality pixel annotations paired with rich linguistic supervision leading to brittle performance under distribution shift. Therefore, we present CORA, a semi-supervised reasoning segmentation framework that jointly learns from limited labeled data and a large corpus of unlabeled images. CORA introduces three main components: 1) conditional visual instructions that encode spatial and contextual relationships between objects; 2) a noisy pseudo-label filter based on the consistency of Multimodal LLM's outputs across semantically equivalent queries; and 3) a token-level contrastive alignment between labeled and pseudo-labeled samples to enhance feature consistency. These components enable CORA to perform robust reasoning segmentation with minimal supervision, outperforming existing baselines under constrained annotation settings. CORA achieves state-of-the-art results, requiring as few as 100 labeled images on Cityscapes, a benchmark dataset for urban scene understanding, surpassing the baseline by $+2.3\%$. Similarly, CORA improves performance by $+2.4\%$ with only 180 labeled images on PanNuke, a histopathology dataset.

Related papers

PLESS: Pseudo-Label Enhancement with Spreading Scribbles for Weakly Supervised Segmentation [5.862480696321742]
Weakly supervised learning with scribble annotations uses sparse user-drawn strokes to indicate segmentation labels on a small subset of pixels.<n>Recent scribble-based approaches in medical image segmentation address this limitation using pseudo-label-based training.<n>We propose PLESS, a generic pseudo-label enhancement strategy which improves reliability and spatial consistency.
arXiv Detail & Related papers (2026-02-12T06:24:05Z)
PANC: Prior-Aware Normalized Cut for Object Segmentation [0.0]
We propose a weakly supervised spectral segmentation framework that uses a minimal set of annotated visual tokens.<n>We report strong results on homogeneous, fine-grained, and texture-limited domains.<n>For multi-object benchmarks, the framework showcases explicit, user-controllable semantic segmentation.
arXiv Detail & Related papers (2026-02-06T18:07:20Z)
DBGroup: Dual-Branch Point Grouping for Weakly Supervised 3D Instance Segmentation [12.044632781901088]
Weakly supervised 3D instance segmentation is essential for 3D scene understanding.<n>Existing methods rely on two forms of weak supervision: one-thing-one-click annotations and bounding box annotations.<n>We propose textbfDBGroup, a two-stage weakly supervised 3D instance segmentation framework.
arXiv Detail & Related papers (2025-11-13T06:12:13Z)
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance [54.683384204063934]
Large multi-modal models (LMMs) struggle with inaccurate segmentation and hallucinated comprehension.<n>We propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation.<n>LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks.
arXiv Detail & Related papers (2025-07-08T07:46:26Z)
HVL: Semi-Supervised Segmentation leveraging Hierarchical Vision-Language Synergy with Dynamic Text-Spatial Query Alignment [16.926158907882012]
We propose a unified Vision-Language framework that integrates domain-invariant text embeddings as object queries in a transformer-based segmentation network.<n>Our results show that language-guided segmentation bridges the label efficiency gap and enables new levels of fine-grained generalization.
arXiv Detail & Related papers (2025-06-16T19:05:33Z)
Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts [64.93416171745693]
ThinkFirst is a training-free reasoning segmentation framework.<n>Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image.<n>This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process.
arXiv Detail & Related papers (2025-03-10T16:26:11Z)
Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation [15.941958367737408]
Seg-TTO is a framework for zero-shot, open-vocabulary semantic segmentation.<n>We focus on segmentation-specific test-time optimization to address this gap.<n>Seg-TTO demonstrates clear performance improvements (up to 27% mIoU increase on some datasets) establishing new state-of-the-art.
arXiv Detail & Related papers (2025-01-08T18:58:24Z)
From Few to More: Scribble-based Medical Image Segmentation via Masked Context Modeling and Continuous Pseudo Labels [46.949484720513674]
We propose MaCo, a weakly supervised model designed for medical image segmentation.<n>We evaluate MaCo on three public datasets, comparing it with other weakly supervised methods.
arXiv Detail & Related papers (2024-08-23T03:19:20Z)
African or European Swallow? Benchmarking Large Vision-Language Models for Fine-Grained Object Classification [53.89380284760555]
textttFOCI (textbfFine-grained textbfObject textbfClasstextbfIfication) is a difficult multiple-choice benchmark for fine-grained object classification. textttFOCIxspace complements five popular classification datasets with four domain-specific subsets from ImageNet-21k.
arXiv Detail & Related papers (2024-06-20T16:59:39Z)
Pointly-Supervised Panoptic Segmentation [106.68888377104886]
We propose a new approach to applying point-level annotations for weakly-supervised panoptic segmentation. Instead of the dense pixel-level labels used by fully supervised methods, point-level labels only provide a single point for each target as supervision. We formulate the problem in an end-to-end framework by simultaneously generating panoptic pseudo-masks from point-level labels and learning from them.
arXiv Detail & Related papers (2022-10-25T12:03:51Z)
Unsupervised Semantic Segmentation by Distilling Feature Correspondences [94.73675308961944]
Unsupervised semantic segmentation aims to discover and localize semantically meaningful categories within image corpora without any form of annotation. We present STEGO, a novel framework that distills unsupervised features into high-quality discrete semantic labels. STEGO yields a significant improvement over the prior state of the art, on both the CocoStuff and Cityscapes challenges.
arXiv Detail & Related papers (2022-03-16T06:08:47Z)
Towards Single Stage Weakly Supervised Semantic Segmentation [2.28438857884398]
We present a single-stage approach to weakly supervised semantic segmentation. We use point annotations to generate reliable, on-the-fly pseudo-masks. We significantly outperform other SOTA WSSS methods on recent real-world datasets.
arXiv Detail & Related papers (2021-06-18T18:34:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.