OG: Equip vision occupancy with instance segmentation and visual
grounding
- URL: http://arxiv.org/abs/2307.05873v1
- Date: Wed, 12 Jul 2023 01:59:26 GMT
- Title: OG: Equip vision occupancy with instance segmentation and visual
grounding
- Authors: Zichao Dong, Hang Ji, Weikun Zhang, Xufeng Huang, Junbo Chen
- Abstract summary: Occupancy prediction tasks focus on the inference of both geometry and semantic labels for each voxel.
This paper proposes Occupancy Grounding (OG), a novel method that equips vanilla occupancy instance segmentation ability.
Keys to our approach are (1) affinity field prediction for instance clustering and (2) association strategy for aligning 2D instance masks and 3D occupancy instances.
- Score: 1.0260983653504128
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Occupancy prediction tasks focus on the inference of both geometry and
semantic labels for each voxel, which is an important perception mission.
However, it is still a semantic segmentation task without distinguishing
various instances. Further, although some existing works, such as
Open-Vocabulary Occupancy (OVO), have already solved the problem of open
vocabulary detection, visual grounding in occupancy has not been solved to the
best of our knowledge. To tackle the above two limitations, this paper proposes
Occupancy Grounding (OG), a novel method that equips vanilla occupancy instance
segmentation ability and could operate visual grounding in a voxel manner with
the help of grounded-SAM. Keys to our approach are (1) affinity field
prediction for instance clustering and (2) association strategy for aligning 2D
instance masks and 3D occupancy instances. Extensive experiments have been
conducted whose visualization results and analysis are shown below. Our code
will be publicly released soon.
Related papers
- ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting [24.56720920528011]
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting them to embodied decision-making in open-world environments presents challenges.
A key issue is the difficulty in smoothly connecting individual entities in low-level observations with abstract concepts required for planning.
We propose visual-temporal context prompting, a novel communication protocol between VLMs and policy models.
arXiv Detail & Related papers (2024-10-23T13:26:59Z) - In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation [50.79940712523551]
We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding.
Our model requires no additional training yet shows great performance on five public datasets.
arXiv Detail & Related papers (2024-08-09T09:28:35Z) - Aligning and Prompting Everything All at Once for Universal Visual
Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks.
APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection.
Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z) - A Simple Framework for Open-Vocabulary Segmentation and Detection [85.21641508535679]
We present OpenSeeD, a simple Open-vocabulary and Detection framework that jointly learns from different segmentation and detection datasets.
We first introduce a pre-trained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them.
After pre-training, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection.
arXiv Detail & Related papers (2023-03-14T17:58:34Z) - UniVIP: A Unified Framework for Self-Supervised Visual Pre-training [50.87603616476038]
We propose a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset.
Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance.
Our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing.
arXiv Detail & Related papers (2022-03-14T10:04:04Z) - Semantic Tracklets: An Object-Centric Representation for Visual
Multi-Agent Reinforcement Learning [126.57680291438128]
We study whether scalability can be achieved via a disentangled representation.
We evaluate semantic tracklets' on the visual multi-agent particle environment (VMPE) and on the challenging visual multi-agent GFootball environment.
Notably, this method is the first to successfully learn a strategy for five players in the GFootball environment using only visual data.
arXiv Detail & Related papers (2021-08-06T22:19:09Z) - Unsupervised Semantic Segmentation by Contrasting Object Mask Proposals [78.12377360145078]
We introduce a novel two-step framework that adopts a predetermined prior in a contrastive optimization objective to learn pixel embeddings.
This marks a large deviation from existing works that relied on proxy tasks or end-to-end clustering.
In particular, when fine-tuning the learned representations using just 1% of labeled examples on PASCAL, we outperform supervised ImageNet pre-training by 7.1% mIoU.
arXiv Detail & Related papers (2021-02-11T18:54:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.