PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
- URL: http://arxiv.org/abs/2407.16696v1
- Date: Tue, 23 Jul 2024 17:58:26 GMT
- Title: PartGLEE: A Foundation Model for Recognizing and Parsing Any Objects
- Authors: Junyi Li, Junfeng Wu, Weizhi Zhao, Song Bai, Xiang Bai,
- Abstract summary: We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images.
PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario.
- Score: 104.34288029037141
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present PartGLEE, a part-level foundation model for locating and identifying both objects and parts in images. Through a unified framework, PartGLEE accomplishes detection, segmentation, and grounding of instances at any granularity in the open world scenario. Specifically, we propose a Q-Former to construct the hierarchical relationship between objects and parts, parsing every object into corresponding semantic parts. By incorporating a large amount of object-level data, the hierarchical relationships can be extended, enabling PartGLEE to recognize a rich variety of parts. We conduct comprehensive studies to validate the effectiveness of our method, PartGLEE achieves the state-of-the-art performance across various part-level tasks and obtain competitive results on object-level tasks. The proposed PartGLEE significantly enhances hierarchical modeling capabilities and part-level perception over our previous GLEE model. Further analysis indicates that the hierarchical cognitive ability of PartGLEE is able to facilitate a detailed comprehension in images for mLLMs. The model and code will be released at https://provencestar.github.io/PartGLEE-Vision/ .
Related papers
- FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension [10.482908189805872]
Referring Expression (REC) is a crucial cross-modal task that objectively evaluates the capabilities of language understanding, image comprehension, and language-to-image grounding.
We have established a new REC dataset characterized by two key features.
It includes negative text and images created through fine-grained editing and generation based on existing data.
arXiv Detail & Related papers (2024-09-23T06:56:51Z) - OMG-Seg: Is One Model Good Enough For All Segmentation? [83.17068644513144]
OMG-Seg is a transformer-based encoder-decoder architecture with task-specific queries and outputs.
We show that OMG-Seg can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead.
arXiv Detail & Related papers (2024-01-18T18:59:34Z) - General Object Foundation Model for Images and Videos at Scale [99.2806103051613]
We present GLEE, an object-level foundation model for locating and identifying objects in images and videos.
GLEE accomplishes detection, segmentation, tracking, grounding, and identification of arbitrary objects in the open world scenario.
We employ an image encoder, text encoder, and visual prompter to handle multi-modal inputs, enabling to simultaneously solve various object-centric downstream tasks.
arXiv Detail & Related papers (2023-12-14T17:26:00Z) - OV-PARTS: Towards Open-Vocabulary Part Segmentation [31.136262413989858]
Segmenting and recognizing diverse object parts is a crucial ability in applications spanning various computer vision and robotic tasks.
We propose an Open-Vocabulary Part (OV-PARTS) benchmark to investigate and tackle these challenges.
OV-PARTS includes refined versions of two publicly available datasets: Pascal-Part-116 and ADE20K--234. And it covers three specific tasks: Generalized Zero-Shot Part analog, Cross-Dataset Part, and Few-Shot Part.
arXiv Detail & Related papers (2023-10-08T10:28:42Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - GAPartNet: Cross-Category Domain-Generalizable Object Perception and
Manipulation via Generalizable and Actionable Parts [28.922958261132475]
We learn cross-category skills via Generalizable and Actionable Parts (GAParts)
Based on GAPartNet, we investigate three cross-category tasks: part segmentation, part pose estimation, and part-based object manipulation.
Our method outperforms all existing methods by a large margin, no matter on seen or unseen categories.
arXiv Detail & Related papers (2022-11-10T00:30:22Z) - Segmenting Moving Objects via an Object-Centric Layered Representation [100.26138772664811]
We introduce an object-centric segmentation model with a depth-ordered layer representation.
We introduce a scalable pipeline for generating synthetic training data with multiple objects.
We evaluate the model on standard video segmentation benchmarks.
arXiv Detail & Related papers (2022-07-05T17:59:43Z) - Look-into-Object: Self-supervised Structure Modeling for Object
Recognition [71.68524003173219]
We propose to "look into object" (explicitly yet intrinsically model the object structure) through incorporating self-supervisions.
We show the recognition backbone can be substantially enhanced for more robust representation learning.
Our approach achieves large performance gain on a number of benchmarks, including generic object recognition (ImageNet) and fine-grained object recognition tasks (CUB, Cars, Aircraft)
arXiv Detail & Related papers (2020-03-31T12:22:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.