SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches
- URL: http://arxiv.org/abs/2501.16022v2
- Date: Mon, 17 Mar 2025 21:44:42 GMT
- Title: SketchYourSeg: Mask-Free Subjective Image Segmentation via Freehand Sketches
- Authors: Subhadeep Koley, Viswanatha Reddy Gajjala, Aneeshan Sain, Pinaki Nath Chowdhury, Tao Xiang, Ayan Kumar Bhunia, Yi-Zhe Song,
- Abstract summary: SketchYourSeg establishes freehand sketches as a powerful query modality for subjective image segmentation.<n>Our evaluations demonstrate superior performance over existing approaches across diverse benchmarks.
- Score: 116.1810651297801
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: We introduce SketchYourSeg, a novel framework that establishes freehand sketches as a powerful query modality for subjective image segmentation across entire galleries through a single exemplar sketch. Unlike text prompts that struggle with spatial specificity or interactive methods confined to single-image operations, sketches naturally combine semantic intent with structural precision. This unique dual encoding enables precise visual disambiguation for segmentation tasks where text descriptions would be cumbersome or ambiguous -- such as distinguishing between visually similar instances, specifying exact part boundaries, or indicating spatial relationships in composed concepts. Our approach addresses three fundamental challenges: (i) eliminating the need for pixel-perfect annotation masks during training with a mask-free framework; (ii) creating a synergistic relationship between sketch-based image retrieval (SBIR) models and foundation models (CLIP/DINOv2) where the former provides training signals while the latter generates masks; and (iii) enabling multi-granular segmentation capabilities through purpose-made sketch augmentation strategies. Our extensive evaluations demonstrate superior performance over existing approaches across diverse benchmarks, establishing a new paradigm for user-guided image segmentation that balances precision with efficiency.
Related papers
- HiMTok: Learning Hierarchical Mask Tokens for Image Segmentation with Large Multimodal Model [6.641903410779405]
We propose the Hierarchical Mask Tokenizer (HiMTok), which represents segmentation masks with up to 32 tokens.
HiMTok allows for compact and coarse-to-fine mask representations, aligning well with the next-token-prediction paradigm.
We develop a 3-stage training recipe for progressive learning of segmentation and visual capabilities, featuring a hierarchical mask loss for effective coarse-to-fine learning.
arXiv Detail & Related papers (2025-03-17T10:29:08Z) - BEN: Using Confidence-Guided Matting for Dichotomous Image Segmentation [0.0]
We propose a new architectural approach for image segmentation called Confidence-Guided Matting (CGM)<n>BEN is comprised of two components: BEN Base for initial segmentation and BEN Refiner for confidence refinement.<n>Our approach achieves substantial improvements over current state-of-the-art methods on the DIS5K validation dataset.
arXiv Detail & Related papers (2025-01-08T01:30:11Z) - Early Fusion of Features for Semantic Segmentation [10.362589129094975]
This paper introduces a novel segmentation framework that integrates a classifier network with a reverse HRNet architecture for efficient image segmentation.
Our methodology is rigorously tested across several benchmark datasets including Mapillary Vistas, Cityscapes, CamVid, COCO, and PASCAL-VOC2012.
The results demonstrate the effectiveness of our proposed model in achieving high segmentation accuracy, indicating its potential for various applications in image analysis.
arXiv Detail & Related papers (2024-02-08T22:58:06Z) - Generalizable Entity Grounding via Assistance of Large Language Model [77.07759442298666]
We propose a novel approach to densely ground visual entities from a long caption.
We leverage a large multimodal model to extract semantic nouns, a class-a segmentation model to generate entity-level segmentation, and a multi-modal feature fusion module to associate each semantic noun with its corresponding segmentation mask.
arXiv Detail & Related papers (2024-02-04T16:06:05Z) - Pixel-Level Clustering Network for Unsupervised Image Segmentation [3.69853388955692]
We present a pixel-level clustering framework for segmenting images into regions without using ground truth annotations.
We also propose a training strategy that utilizes intra-consistency within each superpixel, inter-similarity/dissimilarity between neighboring superpixels, and structural similarity between images.
arXiv Detail & Related papers (2023-10-24T23:06:29Z) - Exploring Open-Vocabulary Semantic Segmentation without Human Labels [76.15862573035565]
We present ZeroSeg, a novel method that leverages the existing pretrained vision-language model (VL) to train semantic segmentation models.
ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image.
Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data.
arXiv Detail & Related papers (2023-06-01T08:47:06Z) - CM-MaskSD: Cross-Modality Masked Self-Distillation for Referring Image
Segmentation [29.885991324519463]
We propose a novel cross-modality masked self-distillation framework named CM-MaskSD.
Our method inherits the transferred knowledge of image-text semantic alignment from CLIP model to realize fine-grained patch-word feature alignment.
Our framework can considerably boost model performance in a nearly parameter-free manner.
arXiv Detail & Related papers (2023-05-19T07:17:27Z) - CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [56.58365347854647]
We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP.
Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
arXiv Detail & Related papers (2023-03-21T12:28:21Z) - ViewCo: Discovering Text-Supervised Segmentation Masks via Multi-View
Semantic Consistency [126.88107868670767]
We propose multi-textbfView textbfConsistent learning (ViewCo) for text-supervised semantic segmentation.
We first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
We also propose cross-view segmentation consistency modeling to address the ambiguity issue of text supervision.
arXiv Detail & Related papers (2023-01-31T01:57:52Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - A Simple Baseline for Zero-shot Semantic Segmentation with Pre-trained
Vision-language Model [61.58071099082296]
It is unclear how to make zero-shot recognition working well on broader vision problems, such as object detection and semantic segmentation.
In this paper, we target for zero-shot semantic segmentation, by building it on an off-the-shelf pre-trained vision-language model, i.e., CLIP.
Our experimental results show that this simple framework surpasses previous state-of-the-arts by a large margin.
arXiv Detail & Related papers (2021-12-29T18:56:18Z) - One Sketch for All: One-Shot Personalized Sketch Segmentation [84.45203849671003]
We present the first one-shot personalized sketch segmentation method.
We aim to segment all sketches belonging to the same category with a single sketch with a given part annotation.
We preserve the parts semantics embedded in the exemplar, and we are robust to input style and abstraction.
arXiv Detail & Related papers (2021-12-20T20:10:44Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - CrossATNet - A Novel Cross-Attention Based Framework for Sketch-Based
Image Retrieval [30.249581102239645]
We propose a novel framework for cross-modal zero-shot learning (ZSL) in the context of sketch-based image retrieval (SBIR)
While we define a cross-modal triplet loss to ensure the discriminative nature of the shared space, an innovative cross-modal attention learning strategy is also proposed to guide feature extraction from the image domain.
arXiv Detail & Related papers (2021-04-20T12:11:12Z) - SCNet: Enhancing Few-Shot Semantic Segmentation by Self-Contrastive
Background Prototypes [56.387647750094466]
Few-shot semantic segmentation aims to segment novel-class objects in a query image with only a few annotated examples.
Most of advanced solutions exploit a metric learning framework that performs segmentation through matching each pixel to a learned foreground prototype.
This framework suffers from biased classification due to incomplete construction of sample pairs with the foreground prototype only.
arXiv Detail & Related papers (2021-04-19T11:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.