Related papers: Explicit Visual Prompting for Low-Level Structure Segmentations

Explicit Visual Prompting for Low-Level Structure Segmentations

URL: http://arxiv.org/abs/2303.10883v2
Date: Tue, 21 Mar 2023 07:25:09 GMT
Title: Explicit Visual Prompting for Low-Level Structure Segmentations
Authors: Weihuang Liu, Xi Shen, Chi-Man Pun, Xiaodong Cun
Abstract summary: We propose a new visual prompting model, named Explicit Visual Prompting (EVP) EVP significantly outperforms other parameter-efficient tuning protocols under the same amount of tunable parameters. EVP also achieves state-of-the-art performances on diverse low-level structure segmentation tasks.
Score: 55.51869354956533
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We consider the generic problem of detecting low-level structures in images, which includes segmenting the manipulated parts, identifying out-of-focus pixels, separating shadow regions, and detecting concealed objects. Whereas each such topic has been typically addressed with a domain-specific solution, we show that a unified approach performs well across all of them. We take inspiration from the widely-used pre-training and then prompt tuning protocols in NLP and propose a new visual prompting model, named Explicit Visual Prompting (EVP). Different from the previous visual prompting which is typically a dataset-level implicit embedding, our key insight is to enforce the tunable parameters focusing on the explicit visual content from each individual image, i.e., the features from frozen patch embeddings and the input's high-frequency components. The proposed EVP significantly outperforms other parameter-efficient tuning protocols under the same amount of tunable parameters (5.7% extra trainable parameters of each task). EVP also achieves state-of-the-art performances on diverse low-level structure segmentation tasks compared to task-specific solutions. Our code is available at: https://github.com/NiFangBaAGe/Explicit-Visual-Prompt.

Related papers

LoR-VP: Low-Rank Visual Prompting for Efficient Vision Model Adaptation [41.77434289193232]
We propose a novel visual prompt design, introducing Low-Rank matrix multiplication for Visual Prompting (LoR-VP) LoR-VP enables shared and patch-specific information across rows and columns of image pixels. Experiments demonstrate significant improvements in both performance and efficiency compared to state-of-the-art visual prompting methods.
arXiv Detail & Related papers (2025-02-02T20:10:48Z)
Test-Time Optimization for Domain Adaptive Open Vocabulary Segmentation [15.941958367737408]
Seg-TTO is a framework for zero-shot, open-vocabulary semantic segmentation. We focus on segmentation-specific test-time optimization to address this gap. Seg-TTO demonstrates clear performance improvements (up to 27% mIoU increase on some datasets) establishing new state-of-the-art.
arXiv Detail & Related papers (2025-01-08T18:58:24Z)
LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query. We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask. We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z)
MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation [33.67313662538398]
We propose a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone. MROVSeg uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder. We demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks.
arXiv Detail & Related papers (2024-08-27T04:45:53Z)
Optimal Transport Aggregation for Visual Place Recognition [9.192660643226372]
We introduce SALAD, which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD, we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a 'dustbin' cluster, designed to selectively discard features deemed non-informative. Our single-stage method surpasses single-stage baselines in public VPR datasets, but also surpasses two-stage methods that add a re-ranking with significantly higher cost.
arXiv Detail & Related papers (2023-11-27T15:46:19Z)
Visual In-Context Prompting [100.93587329049848]
In this paper, we introduce a universal visual in-context prompting framework for both vision tasks like open-set segmentation and detection. We build on top of an encoder-decoder architecture, and develop a versatile prompt encoder to support a variety of prompts like strokes, boxes, and points. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities.
arXiv Detail & Related papers (2023-11-22T18:59:48Z)
Exploring Part-Informed Visual-Language Learning for Person Re-Identification [52.92511980835272]
We propose Part-Informed Visual-language Learning ($pi$-VL) to enhance fine-grained visual features with part-informed language supervisions for ReID tasks. $pi$-VL introduces a human parsing-guided prompt tuning strategy and a hierarchical visual-language alignment paradigm to ensure within-part feature semantic consistency. As a plug-and-play and inference-free solution, our $pi$-VL achieves performance comparable to or better than state-of-the-art methods on four commonly used ReID benchmarks.
arXiv Detail & Related papers (2023-08-04T23:13:49Z)
A Dynamic Feature Interaction Framework for Multi-task Visual Perception [100.98434079696268]
We devise an efficient unified framework to solve multiple common perception tasks. These tasks include instance segmentation, semantic segmentation, monocular 3D detection, and depth estimation. Our proposed framework, termed D2BNet, demonstrates a unique approach to parameter-efficient predictions for multi-task perception.
arXiv Detail & Related papers (2023-06-08T09:24:46Z)
Explicit Visual Prompting for Universal Foreground Segmentations [55.51869354956533]
We present a unified framework for a number of foreground segmentation tasks without any task-specific designs. We take inspiration from the widely-used pre-training and then prompt tuning protocols in NLP. Our method freezes a pre-trained model and then learns task-specific knowledge using a few extra parameters.
arXiv Detail & Related papers (2023-05-29T11:05:01Z)
Exploring Sparse Visual Prompt for Domain Adaptive Dense Prediction [22.868597464136787]
We propose a novel Sparse Visual Domain Prompts (SVDP) approach, which holds minimal trainable parameters in the image-level prompt and reserves more spatial information of the input. Our proposed method achieves state-of-the-art performance in both semantic segmentation and depth estimation tasks.
arXiv Detail & Related papers (2023-03-17T06:26:55Z)
SceneEncoder: Scene-Aware Semantic Segmentation of Point Clouds with A Learnable Scene Descriptor [51.298760338410624]
We propose a SceneEncoder module to impose a scene-aware guidance to enhance the effect of global information. The module predicts a scene descriptor, which learns to represent the categories of objects existing in the scene. We also design a region similarity loss to propagate distinguishing features to their own neighboring points with the same label.
arXiv Detail & Related papers (2020-01-24T16:53:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.