SCHNet: SAM Marries CLIP for Human Parsing
- URL: http://arxiv.org/abs/2503.22237v1
- Date: Fri, 28 Mar 2025 08:40:06 GMT
- Title: SCHNet: SAM Marries CLIP for Human Parsing
- Authors: Kunliang Liu, Jianming Wang, Rize Jin, Wonjun Hwang, Tae-Sun Chung,
- Abstract summary: The Segment Anything Model (SAM) and Contrastive Language-Image Pre-training Model (CLIP) have shown promising performance for segmentation and detection tasks.<n>We formulate high efficient modules to effectively integrate features of SAM and CLIP to benefit human parsing.
- Score: 11.299133502596517
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Foundation Model (VFM) such as the Segment Anything Model (SAM) and Contrastive Language-Image Pre-training Model (CLIP) has shown promising performance for segmentation and detection tasks. However, although SAM excels in fine-grained segmentation, it faces major challenges when applying it to semantic-aware segmentation. While CLIP exhibits a strong semantic understanding capability via aligning the global features of language and vision, it has deficiencies in fine-grained segmentation tasks. Human parsing requires to segment human bodies into constituent parts and involves both accurate fine-grained segmentation and high semantic understanding of each part. Based on traits of SAM and CLIP, we formulate high efficient modules to effectively integrate features of them to benefit human parsing. We propose a Semantic-Refinement Module to integrate semantic features of CLIP with SAM features to benefit parsing. Moreover, we formulate a high efficient Fine-tuning Module to adjust the pretrained SAM for human parsing that needs high semantic information and simultaneously demands spatial details, which significantly reduces the training time compared with full-time training and achieves notable performance. Extensive experiments demonstrate the effectiveness of our method on LIP, PPP, and CIHP databases.
Related papers
- Cross-Domain Semantic Segmentation with Large Language Model-Assisted Descriptor Generation [0.0]
LangSeg is a novel semantic segmentation method that leverages context-sensitive, fine-grained subclass descriptors.<n>We evaluate LangSeg on two challenging datasets, ADE20K and COCO-Stuff, where it outperforms state-of-the-art models.
arXiv Detail & Related papers (2025-01-27T20:02:12Z) - Effective SAM Combination for Open-Vocabulary Semantic Segmentation [24.126307031048203]
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes.
ESC-Net is a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation.
ESC-Net achieves superior performance on standard benchmarks, including ADE20K, PASCAL-VOC, and PASCAL-Context.
arXiv Detail & Related papers (2024-11-22T04:36:12Z) - Adapting Segment Anything Model for Unseen Object Instance Segmentation [70.60171342436092]
Unseen Object Instance (UOIS) is crucial for autonomous robots operating in unstructured environments.
We propose UOIS-SAM, a data-efficient solution for the UOIS task.
UOIS-SAM integrates two key components: (i) a Heatmap-based Prompt Generator (HPG) to generate class-agnostic point prompts with precise foreground prediction, and (ii) a Hierarchical Discrimination Network (HDNet) that adapts SAM's mask decoder.
arXiv Detail & Related papers (2024-09-23T19:05:50Z) - PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model [49.80313655590392]
PSALM is a powerful extension of the Large Multi-modal Model (LMM) to address the segmentation task challenges.
It incorporates a mask decoder and a well-designed input schema to handle a variety of segmentation tasks.
The flexible design of PSALM supports joint training across multiple datasets and tasks, leading to improved performance and task generalization.
arXiv Detail & Related papers (2024-03-21T17:50:47Z) - PosSAM: Panoptic Open-vocabulary Segment Anything [58.72494640363136]
PosSAM is an open-vocabulary panoptic segmentation model that unifies the strengths of the Segment Anything Model (SAM) with the vision-native CLIP model in an end-to-end framework.
We introduce a Mask-Aware Selective Ensembling (MASE) algorithm that adaptively enhances the quality of generated masks and boosts the performance of open-vocabulary classification during inference for each image.
arXiv Detail & Related papers (2024-03-14T17:55:03Z) - Task-Specific Adaptation of Segmentation Foundation Model via Prompt Learning [7.6136466242670435]
We propose a task-specific adaptation of the segmentation foundation model via prompt learning tailored to the Segment Anything Model (SAM)
Our method involves a prompt learning module which adjusts input prompts into the embedding space to better align with peculiarities of the target task.
Experimental results on various customized segmentation scenarios demonstrate the effectiveness of the proposed method.
arXiv Detail & Related papers (2024-03-14T09:13:51Z) - ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation [5.376142948115328]
We propose a CLIP and SAM collaboration framework called ClipSAM for ZSAS.
The insight behind ClipSAM is to employ CLIP's semantic understanding capability for anomaly localization and rough segmentation.
In details, we introduce a crucial Unified Multi-scale Cross-modal Interaction (UMCI) module for interacting with visual features.
arXiv Detail & Related papers (2024-01-23T11:20:03Z) - Semantic-aware SAM for Point-Prompted Instance Segmentation [29.286913777078116]
In this paper, we introduce a cost-effective category-specific segmenter using Segment Anything (SAM)
To tackle this challenge, we have devised a Semantic-Aware Instance Network (SAPNet) that integrates Multiple Instance Learning (MIL) with matching capability and SAM with point prompts.
SAPNet strategically selects the most representative mask proposals generated by SAM to supervise segmentation, with a specific focus on object category information.
arXiv Detail & Related papers (2023-12-26T05:56:44Z) - SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding [40.40630116715132]
The landscape of publicly available vision foundation models (VFMs) is expanding rapidly.
We introduce a simple recipe to efficiently merge VFMs into a unified model that absorbs their expertise.
By applying our method to SAM and CLIP, we obtain SAM-CLIP: a unified model that combines the capabilities of SAM and CLIP into a single vision transformer.
arXiv Detail & Related papers (2023-10-23T19:21:57Z) - Semantic-SAM: Segment and Recognize Anything at Any Granularity [83.64686655044765]
We introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity.
We consolidate multiple datasets across three granularities and introduce decoupled classification for objects and parts.
For the multi-granularity capability, we propose a multi-choice learning scheme during training, enabling each click to generate masks at multiple levels.
arXiv Detail & Related papers (2023-07-10T17:59:40Z) - RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation [53.4319652364256]
This paper presents the RefSAM model, which explores the potential of SAM for referring video object segmentation.
Our proposed approach adapts the original SAM model to enhance cross-modality learning by employing a lightweight Cross-RValModal.
We employ a parameter-efficient tuning strategy to align and fuse the language and vision features effectively.
arXiv Detail & Related papers (2023-07-03T13:21:58Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.