A Simple Framework for Open-Vocabulary Segmentation and Detection
- URL: http://arxiv.org/abs/2303.08131v3
- Date: Mon, 20 Mar 2023 10:52:40 GMT
- Title: A Simple Framework for Open-Vocabulary Segmentation and Detection
- Authors: Hao Zhang, Feng Li, Xueyan Zou, Shilong Liu, Chunyuan Li, Jianfeng
Gao, Jianwei Yang, Lei Zhang
- Abstract summary: We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets.
We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them.
After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
- Score: 85.21641508535679
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection
framework that jointly learns from different segmentation and detection
datasets. To bridge the gap of vocabulary and annotation granularity, we first
introduce a pre-trained text encoder to encode all the visual concepts in two
tasks and learn a common semantic space for them. This gives us reasonably good
results compared with the counterparts trained on segmentation task only. To
further reconcile them, we locate two discrepancies: $i$) task discrepancy --
segmentation requires extracting masks for both foreground objects and
background stuff, while detection merely cares about the former; $ii$) data
discrepancy -- box and mask annotations are with different spatial granularity,
and thus not directly interchangeable. To address these issues, we propose a
decoupled decoding to reduce the interference between foreground/background and
a conditioned mask decoding to assist in generating masks for given boxes. To
this end, we develop a simple encoder-decoder model encompassing all three
techniques and train it jointly on COCO and Objects365. After pre-training, our
model exhibits competitive or stronger zero-shot transferability for both
segmentation and detection. Specifically, OpenSeeD beats the state-of-the-art
method for open-vocabulary instance and panoptic segmentation across 5
datasets, and outperforms previous work for open-vocabulary detection on LVIS
and ODinW under similar settings. When transferred to specific tasks, our model
achieves new SoTA for panoptic segmentation on COCO and ADE20K, and instance
segmentation on ADE20K and Cityscapes.
Finally, we note that OpenSeeD is the first to explore the potential of joint
training on segmentation and detection, and hope it can be received as a strong
baseline for developing a single model for both tasks in open world.
Related papers
- DQFormer: Towards Unified LiDAR Panoptic Segmentation with Decoupled Queries [14.435906383301555]
We propose a novel framework dubbed DQFormer to implement semantic and instance segmentation in a unified workflow.
Specifically, we design a decoupled query generator to propose informative queries with semantics by localizing things/stuff positions.
We also introduce a query-oriented mask decoder to decode corresponding segmentation masks.
arXiv Detail & Related papers (2024-08-28T14:14:33Z) - SegVG: Transferring Object Bounding Box to Segmentation for Visual Grounding [56.079013202051094]
We present SegVG, a novel method transfers the box-level annotation as signals to provide an additional pixel-level supervision for Visual Grounding.
This approach allows us to iteratively exploit the annotation as signals for both box-level regression and pixel-level segmentation.
arXiv Detail & Related papers (2024-07-03T15:30:45Z) - OpenSD: Unified Open-Vocabulary Segmentation and Detection [24.08879095731279]
We present a universal transformer-based framework, abbreviated as OpenSD, to handle open-vocabulary segmentation and detection tasks.
To better leverage CLIP for end-to-end segmentation and detection, we propose dual classifiers to handle the in-vocabulary domain and out-of-vocabulary domain.
The results demonstrate that OpenSD outperforms state-of-the-art open-vocabulary segmentation and detection methods in both closed- and open-vocabulary settings.
arXiv Detail & Related papers (2023-12-10T08:51:34Z) - Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS)
We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes.
By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z) - OG: Equip vision occupancy with instance segmentation and visual
grounding [1.0260983653504128]
Occupancy prediction tasks focus on the inference of both geometry and semantic labels for each voxel.
This paper proposes Occupancy Grounding (OG), a novel method that equips vanilla occupancy instance segmentation ability.
Keys to our approach are (1) affinity field prediction for instance clustering and (2) association strategy for aligning 2D instance masks and 3D occupancy instances.
arXiv Detail & Related papers (2023-07-12T01:59:26Z) - Segment Everything Everywhere All at Once [124.90835636901096]
We present SEEM, a promptable and interactive model for segmenting everything everywhere all at once in an image.
We propose a novel decoding mechanism that enables diverse prompting for all types of segmentation tasks.
We conduct a comprehensive empirical study to validate the effectiveness of SEEM across diverse segmentation tasks.
arXiv Detail & Related papers (2023-04-13T17:59:40Z) - Towards Universal Vision-language Omni-supervised Segmentation [72.31277932442988]
We present Vision-Language Omni-Supervised (VLOSS) to treat open-world segmentation tasks as proposal classification.
We leverage omni-supervised data (i.e., panoptic segmentation data, object detection data, and image-text pairs data) into training, thus enriching the open-world segmentation ability.
With fewer parameters, our VLOSS with Swin-Tiny surpasses MaskCLIP by 2% in terms of mask AP on LVIS v1 dataset.
arXiv Detail & Related papers (2023-03-12T02:57:53Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.