Hierarchical Open-vocabulary Universal Image Segmentation
- URL: http://arxiv.org/abs/2307.00764v2
- Date: Thu, 21 Dec 2023 18:28:31 GMT
- Title: Hierarchical Open-vocabulary Universal Image Segmentation
- Authors: Xudong Wang and Shufan Li and Konstantinos Kallidromitis and Yusuke
Kato and Kazuki Kozuka and Trevor Darrell
- Abstract summary: Open-vocabulary image segmentation aims to partition an image into semantic regions according to arbitrary text descriptions.
We propose a decoupled text-image fusion mechanism and representation learning modules for both "things" and "stuff"
Our resulting model, named HIPIE tackles, HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a unified framework.
- Score: 48.008887320870244
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary image segmentation aims to partition an image into semantic
regions according to arbitrary text descriptions. However, complex visual
scenes can be naturally decomposed into simpler parts and abstracted at
multiple levels of granularity, introducing inherent segmentation ambiguity.
Unlike existing methods that typically sidestep this ambiguity and treat it as
an external factor, our approach actively incorporates a hierarchical
representation encompassing different semantic-levels into the learning
process. We propose a decoupled text-image fusion mechanism and representation
learning modules for both "things" and "stuff". Additionally, we systematically
examine the differences that exist in the textual and visual features between
these types of categories. Our resulting model, named HIPIE, tackles
HIerarchical, oPen-vocabulary, and unIvErsal segmentation tasks within a
unified framework. Benchmarked on over 40 datasets, e.g., ADE20K, COCO,
Pascal-VOC Part, RefCOCO/RefCOCOg, ODinW and SeginW, HIPIE achieves the
state-of-the-art results at various levels of image comprehension, including
semantic-level (e.g., semantic segmentation), instance-level (e.g.,
panoptic/referring segmentation and object detection), as well as part-level
(e.g., part/subpart segmentation) tasks. Our code is released at
https://github.com/berkeley-hipie/HIPIE.
Related papers
- SPIN: Hierarchical Segmentation with Subpart Granularity in Natural Images [17.98848062686217]
We introduce the first hierarchical semantic segmentation dataset with subpart annotations for natural images.
We also introduce two novel evaluation metrics to evaluate how well algorithms capture spatial and semantic relationships across hierarchical levels.
arXiv Detail & Related papers (2024-07-12T21:08:00Z) - USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation [33.11010205890195]
The main challenge in open-vocabulary image segmentation now lies in accurately classifying these segments into text-defined categories.
We introduce the Universal Segment Embedding (USE) framework to address this challenge.
This framework is comprised of two key components: 1) a data pipeline designed to efficiently curate a large amount of segment-text pairs at various granularities, and 2) a universal segment embedding model that enables precise segment classification into a vast range of text-defined categories.
arXiv Detail & Related papers (2024-06-07T21:41:18Z) - SegGPT: Segmenting Everything In Context [98.98487097934067]
We present SegGPT, a model for segmenting everything in context.
We unify various segmentation tasks into a generalist in-context learning framework.
SegGPT can perform arbitrary segmentation tasks in images or videos via in-context inference.
arXiv Detail & Related papers (2023-04-06T17:59:57Z) - Open-world Semantic Segmentation via Contrasting and Clustering
Vision-Language Embedding [95.78002228538841]
We propose a new open-world semantic segmentation pipeline that makes the first attempt to learn to segment semantic objects of various open-world categories without any efforts on dense annotations.
Our method can directly segment objects of arbitrary categories, outperforming zero-shot segmentation methods that require data labeling on three benchmark datasets.
arXiv Detail & Related papers (2022-07-18T09:20:04Z) - Deep Hierarchical Semantic Segmentation [76.40565872257709]
hierarchical semantic segmentation (HSS) aims at structured, pixel-wise description of visual observation in terms of a class hierarchy.
HSSN casts HSS as a pixel-wise multi-label classification task, only bringing minimal architecture change to current segmentation models.
With hierarchy-induced margin constraints, HSSN reshapes the pixel embedding space, so as to generate well-structured pixel representations.
arXiv Detail & Related papers (2022-03-27T15:47:44Z) - TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic
Segmentation [44.75300205362518]
Unsupervised semantic segmentation aims to obtain high-level semantic representation on low-level visual features without manual annotations.
We propose the first top-down unsupervised semantic segmentation framework for fine-grained segmentation in extremely complicated scenarios.
Our results show that our top-down unsupervised segmentation is robust to both object-centric and scene-centric datasets.
arXiv Detail & Related papers (2021-12-02T18:59:03Z) - Exploring Cross-Image Pixel Contrast for Semantic Segmentation [130.22216825377618]
We propose a pixel-wise contrastive framework for semantic segmentation in the fully supervised setting.
The core idea is to enforce pixel embeddings belonging to a same semantic class to be more similar than embeddings from different classes.
Our method can be effortlessly incorporated into existing segmentation frameworks without extra overhead during testing.
arXiv Detail & Related papers (2021-01-28T11:35:32Z) - Improving Semantic Segmentation via Decoupled Body and Edge Supervision [89.57847958016981]
Existing semantic segmentation approaches either aim to improve the object's inner consistency by modeling the global context, or refine objects detail along their boundaries by multi-scale feature fusion.
In this paper, a new paradigm for semantic segmentation is proposed.
Our insight is that appealing performance of semantic segmentation requires textitexplicitly modeling the object textitbody and textitedge, which correspond to the high and low frequency of the image.
We show that the proposed framework with various baselines or backbone networks leads to better object inner consistency and object boundaries.
arXiv Detail & Related papers (2020-07-20T12:11:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.