Related papers: Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation

Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation

URL: http://arxiv.org/abs/2508.16159v1
Date: Fri, 22 Aug 2025 07:29:30 GMT
Title: Through the Looking Glass: A Dual Perspective on Weakly-Supervised Few-Shot Segmentation
Authors: Jiaqi Ma, Guo-Sen Xie, Fang Zhao, Zechao Li,
Abstract summary: Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes.<n>This identical network design results in over-semantic homogenization.<n>We propose a novel but heterogeneous network to enhance complementarity and preserve semantic commonality.<n>In the weakly-supervised few-shot semantic segmentation (WFSS) task, TLG achieves a 13.2% improvement on Pascal-5textsuperscripti and a 9.7% improvement on COCO-20textsuperscripti.
Score: 46.635612270422655
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Meta-learning aims to uniformly sample homogeneous support-query pairs, characterized by the same categories and similar attributes, and extract useful inductive biases through identical network architectures. However, this identical network design results in over-semantic homogenization. To address this, we propose a novel homologous but heterogeneous network. By treating support-query pairs as dual perspectives, we introduce heterogeneous visual aggregation (HA) modules to enhance complementarity while preserving semantic commonality. To further reduce semantic noise and amplify the uniqueness of heterogeneous semantics, we design a heterogeneous transfer (HT) module. Finally, we propose heterogeneous CLIP (HC) textual information to enhance the generalization capability of multimodal models. In the weakly-supervised few-shot semantic segmentation (WFSS) task, with only 1/24 of the parameters of existing state-of-the-art models, TLG achieves a 13.2\% improvement on Pascal-5\textsuperscript{i} and a 9.7\% improvement on COCO-20\textsuperscript{i}. To the best of our knowledge, TLG is also the first weakly supervised (image-level) model that outperforms fully supervised (pixel-level) models under the same backbone architectures. The code is available at https://github.com/jarch-ma/TLG.

Related papers

DRFormer: A Dual-Regularized Bidirectional Transformer for Person Re-identification [37.61695934257133]
Fine-grained discriminative details and global semantic features can contribute to solving person re-identification challenges.<n>Vision foundation models excel at mining local textures, and vision-language models capture strong global semantic difference.<n>We propose a framework to synergize their strengths by a textbfDual-textbfRegularized Bidirectional textbfTransformer.
arXiv Detail & Related papers (2026-02-01T06:59:53Z)
UniMRSeg: Unified Modality-Relax Segmentation via Hierarchical Self-Supervised Compensation [104.59740403500132]
Multi-modal image segmentation faces real-world deployment challenges from incomplete/corrupted modalities degrading performance.<n>We propose a unified modality-relax segmentation network (UniMRSeg) through hierarchical self-supervised compensation (HSSC)<n>Our approach hierarchically bridges representation gaps between complete and incomplete modalities across input, feature and output levels.
arXiv Detail & Related papers (2025-09-19T17:29:25Z)
Parameter-free entropy-regularized multi-view clustering with hierarchical feature selection [3.8015092217142237]
This work introduces two complementary algorithms: AMVFCM-U and AAMVFCM-U, providing a unified parameter-free framework.<n>AAMVFCM-U achieves up to 97% computational efficiency gains, reduces dimensionality to 0.45% of original size, and automatically identifies critical view combinations.
arXiv Detail & Related papers (2025-08-07T15:36:59Z)
Self-Enhanced Image Clustering with Cross-Modal Semantic Consistency [57.961869351897384]
We propose a framework based on cross-modal semantic consistency for efficient image clustering.<n>Our framework first builds a strong foundation via Cross-Modal Semantic Consistency.<n>In the first stage, we train lightweight clustering heads to align with the rich semantics of the pre-trained model.<n>In the second stage, we introduce a Self-Enhanced fine-tuning strategy.
arXiv Detail & Related papers (2025-08-02T08:12:57Z)
Learning Robust Heterogeneous Graph Representations via Contrastive-Reconstruction under Sparse Semantics [13.555683316315683]
Masked autoencoders (MAE) and contrastive learning (CL) are two prominent paradigms in graph self-supervised learning.<n>This paper introduces HetCRF, a novel dual-channel self-supervised learning framework for heterogeneous graphs.<n>HetCRF uses a two-stage aggregation strategy to adapt embedding semantics, making it compatible with both MAE and CL.
arXiv Detail & Related papers (2025-06-07T06:35:42Z)
DINOv2-powered Few-Shot Semantic Segmentation: A Unified Framework via Cross-Model Distillation and 4D Correlation Mining [30.564216896513596]
Few-shot semantic segmentation has gained increasing interest due to its generalization capability.<n>Recent approaches have turned to foundation models to enhance representation transferability.<n>We propose FS-DINO, with only DINOv2's encoder and a lightweight segmenter.
arXiv Detail & Related papers (2025-04-22T07:47:06Z)
Spatial Semantic Recurrent Mining for Referring Image Segmentation [63.34997546393106]
We propose Stextsuperscript2RM to achieve high-quality cross-modality fusion. It follows a working strategy of trilogy: distributing language feature, spatial semantic recurrent coparsing, and parsed-semantic balancing. Our proposed method performs favorably against other state-of-the-art algorithms.
arXiv Detail & Related papers (2024-05-15T00:17:48Z)
Enhancing Representations through Heterogeneous Self-Supervised Learning [61.40674648939691]
We propose Heterogeneous Self-Supervised Learning (HSSL), which enforces a base model to learn from an auxiliary head whose architecture is heterogeneous from the base model. The HSSL endows the base model with new characteristics in a representation learning way without structural changes. The HSSL is compatible with various self-supervised methods, achieving superior performances on various downstream tasks.
arXiv Detail & Related papers (2023-10-08T10:44:05Z)
Histopathology Whole Slide Image Analysis with Heterogeneous Graph Representation Learning [78.49090351193269]
We propose a novel graph-based framework to leverage the inter-relationships among different types of nuclei for WSI analysis. Specifically, we formulate the WSI as a heterogeneous graph with "nucleus-type" attribute to each node and a semantic attribute similarity to each edge. Our framework outperforms the state-of-the-art methods with considerable margins on various tasks.
arXiv Detail & Related papers (2023-07-09T14:43:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.