OpenSD: Unified Open-Vocabulary Segmentation and Detection
- URL: http://arxiv.org/abs/2312.06703v1
- Date: Sun, 10 Dec 2023 08:51:34 GMT
- Title: OpenSD: Unified Open-Vocabulary Segmentation and Detection
- Authors: Shuai Li, Minghan Li, Pengfei Wang, Lei Zhang
- Abstract summary: We present a universal transformer-based framework, abbreviated as OpenSD, to handle open-vocabulary segmentation and detection tasks.
To better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain.
The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings.
- Score: 24.08879095731279
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, a few open-vocabulary methods have been proposed by employing a
unified architecture to tackle generic segmentation and detection tasks.
However, their performance still lags behind the task-specific models due to
the conflict between different tasks, and their open-vocabulary capability is
limited due to the inadequate use of CLIP. To address these challenges, we
present a universal transformer-based framework, abbreviated as OpenSD, which
utilizes the same architecture and network parameters to handle open-vocabulary
segmentation and detection tasks. First, we introduce a decoder decoupled
learning strategy to alleviate the semantic conflict between thing and staff
categories so that each individual task can be learned more effectively under
the same framework. Second, to better leverage CLIP for end-to-end segmentation
and detection, we propose dual classifiers to handle the in-vocabulary domain
and out-of-vocabulary domain, respectively. The text encoder is further trained
to be region-aware for both thing and stuff categories through decoupled prompt
learning, enabling them to filter out duplicated and low-quality predictions,
which is important to end-to-end segmentation and detection. Extensive
experiments are conducted on multiple datasets under various circumstances. The
results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary
segmentation and detection methods in both closed- and open-vocabulary
settings. Code is available at https://github.com/strongwolf/OpenSD
Related papers
- GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning.
The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms.
Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z) - MROVSeg: Breaking the Resolution Curse of Vision-Language Models in Open-Vocabulary Semantic Segmentation [33.67313662538398]
We propose a multi-resolution training framework for open-vocabulary semantic segmentation with a single pretrained CLIP backbone.
MROVSeg uses sliding windows to slice the high-resolution input into uniform patches, each matching the input size of the well-trained image encoder.
We demonstrate the superiority of MROVSeg on well-established open-vocabulary semantic segmentation benchmarks.
arXiv Detail & Related papers (2024-08-27T04:45:53Z) - SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection [31.464227593768324]
We introduce Semantic Hierarchy Nexus (SHiNe), a novel classifier that uses semantic knowledge from class hierarchies.
SHiNe enhances robustness across diverse vocabulary granularities, achieving up to +31.9% mAP50 with ground truth hierarchies.
SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector.
arXiv Detail & Related papers (2024-05-16T12:42:06Z) - LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained
Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector.
It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training.
Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z) - Open-vocabulary Panoptic Segmentation with Embedding Modulation [71.15502078615587]
Open-vocabulary image segmentation is attracting increasing attention due to its critical applications in the real world.
Traditional closed-vocabulary segmentation methods are not able to characterize novel objects, whereas several recent open-vocabulary attempts obtain unsatisfactory results.
We propose OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panopticon.
arXiv Detail & Related papers (2023-03-20T17:58:48Z) - Global Knowledge Calibration for Fast Open-Vocabulary Segmentation [124.74256749281625]
We introduce a text diversification strategy that generates a set of synonyms for each training category.
We also employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP.
Our proposed model achieves robust generalization performance across various datasets.
arXiv Detail & Related papers (2023-03-16T09:51:41Z) - A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets.
We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them.
After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z) - Joint Inductive and Transductive Learning for Video Object Segmentation [107.32760625159301]
Semi-supervised object segmentation is a task of segmenting the target object in a video sequence given only a mask in the first frame.
Most previous best-performing methods adopt matching-based transductive reasoning or online inductive learning.
We propose to integrate transductive and inductive learning into a unified framework to exploit complement between them for accurate and robust video object segmentation.
arXiv Detail & Related papers (2021-08-08T16:25:48Z) - Segmental Contrastive Predictive Coding for Unsupervised Word
Segmentation [33.35220574193796]
We propose a segmental contrastive predictive coding (SCPC) framework that can model the signal structure at a higher level e.g. at the phoneme level.
A differentiable boundary detector finds variable-length segments, which are then used to optimize a segment encoder via NCE.
We show that our single model outperforms existing phoneme and word segmentation methods on TIMIT and Buckeye datasets.
arXiv Detail & Related papers (2021-06-03T23:12:05Z) - Semi-supervised Medical Image Segmentation through Dual-task Consistency [18.18484640332254]
We propose a novel dual-task deep network that jointly predicts a pixel-wise segmentation map and a geometry-aware level set representation of the target.
Our method can largely improve the performance by incorporating the unlabeled data.
Our framework outperforms the state-of-the-art semi-supervised medical image segmentation methods.
arXiv Detail & Related papers (2020-09-09T17:49:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.