Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation
- URL: http://arxiv.org/abs/2507.11030v1
- Date: Tue, 15 Jul 2025 06:51:07 GMT
- Title: Personalized OVSS: Understanding Personal Concept in Open-Vocabulary Semantic Segmentation
- Authors: Sunghyun Park, Jungsoo Lee, Shubhankar Borse, Munawar Hayat, Sungha Choi, Kyuwoong Hwang, Fatih Porikli,
- Abstract summary: We introduce a novel task termed textitpersonalized open-vocabulary semantic segmentation'<n>We propose a text prompt tuning-based plug-in method designed to recognize personal visual concepts using a few pairs of images and masks.<n>We further improve the performance by enriching the representation of text prompts by injecting visual embeddings of the personal concept into them.
- Score: 59.047277629795325
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While open-vocabulary semantic segmentation (OVSS) can segment an image into semantic regions based on arbitrarily given text descriptions even for classes unseen during training, it fails to understand personal texts (e.g., `my mug cup') for segmenting regions of specific interest to users. This paper addresses challenges like recognizing `my mug cup' among `multiple mug cups'. To overcome this challenge, we introduce a novel task termed \textit{personalized open-vocabulary semantic segmentation} and propose a text prompt tuning-based plug-in method designed to recognize personal visual concepts using a few pairs of images and masks, while maintaining the performance of the original OVSS. Based on the observation that reducing false predictions is essential when applying text prompt tuning to this task, our proposed method employs `negative mask proposal' that captures visual concepts other than the personalized concept. We further improve the performance by enriching the representation of text prompts by injecting visual embeddings of the personal concept into them. This approach enhances personalized OVSS without compromising the original OVSS performance. We demonstrate the superiority of our method on our newly established benchmarks for this task, including FSS$^\text{per}$, CUB$^\text{per}$, and ADE$^\text{per}$.
Related papers
- InvSeg: Test-Time Prompt Inversion for Semantic Segmentation [33.60580908728705]
InvSeg is a test-time prompt inversion method that tackles open-vocabulary semantic segmentation.<n>We introduce Contrastive Soft Clustering (CSC) to align derived masks with the image's structure information.<n>InvSeg learns context-rich text prompts in embedding space and achieves accurate semantic alignment across modalities.
arXiv Detail & Related papers (2024-10-15T10:20:31Z) - pOps: Photo-Inspired Diffusion Operators [55.93078592427929]
pOps is a framework that trains semantic operators directly on CLIP image embeddings.
We show that pOps can be used to learn a variety of photo-inspired operators with distinct semantic meanings.
arXiv Detail & Related papers (2024-06-03T13:09:32Z) - HARIS: Human-Like Attention for Reference Image Segmentation [5.808325471170541]
We propose a referring image segmentation method called HARIS, which introduces the Human-Like Attention mechanism.
Our method achieves state-of-the-art performance and great zero-shot ability.
arXiv Detail & Related papers (2024-05-17T11:29:23Z) - CPSeg: Finer-grained Image Semantic Segmentation via Chain-of-Thought
Language Prompting [8.12405696290333]
CPSeg is a framework designed to augment image segmentation performance by integrating a novel "Chain-of-Thought" process.
We propose a new vision-language dataset, FloodPrompt, which includes images, semantic masks, and corresponding text information.
arXiv Detail & Related papers (2023-10-24T13:32:32Z) - Rewrite Caption Semantics: Bridging Semantic Gaps for
Language-Supervised Semantic Segmentation [100.81837601210597]
We propose Concept Curation (CoCu) to bridge the gap between visual and textual semantics in pre-training data.
CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin.
arXiv Detail & Related papers (2023-09-24T00:05:39Z) - CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation [56.58365347854647]
We introduce a novel cost-based approach to adapt vision-language foundation models, notably CLIP.
Our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders.
arXiv Detail & Related papers (2023-03-21T12:28:21Z) - Visually-augmented pretrained language models for NLP tasks without
images [77.74849855049523]
Existing solutions often rely on explicit images for visual knowledge augmentation.
We propose a novel textbfVisually-textbfAugmented fine-tuning approach.
Our approach can consistently improve the performance of BERT, RoBERTa, BART, and T5 at different scales.
arXiv Detail & Related papers (2022-12-15T16:13:25Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z) - Weakly Supervised Visual Semantic Parsing [49.69377653925448]
Scene Graph Generation (SGG) aims to extract entities, predicates and their semantic structure from images.
Existing SGG methods require millions of manually annotated bounding boxes for training.
We propose Visual Semantic Parsing, VSPNet, and graph-based weakly supervised learning framework.
arXiv Detail & Related papers (2020-01-08T03:46:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.