InstanceRefer: Cooperative Holistic Understanding for Visual Grounding
on Point Clouds through Instance Multi-level Contextual Referring
- URL: http://arxiv.org/abs/2103.01128v1
- Date: Mon, 1 Mar 2021 16:59:27 GMT
- Title: InstanceRefer: Cooperative Holistic Understanding for Visual Grounding
on Point Clouds through Instance Multi-level Contextual Referring
- Authors: Zhihao Yuan, Xu Yan, Yinghong Liao, Ruimao Zhang, Zhen Li, Shuguang
Cui
- Abstract summary: We propose a new model, named InstanceRefer, to achieve a superior 3D visual grounding on point clouds.
Our model first filters instances from panoptic segmentation on point clouds to obtain a small number of candidates.
Experiments confirm that our InstanceRefer outperforms previous state-of-the-art methods by a large margin.
- Score: 38.13420293700949
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Compared with the visual grounding in 2D images, the natural-language-guided
3D object localization on point clouds is more challenging due to the sparse
and disordered property. In this paper, we propose a new model, named
InstanceRefer, to achieve a superior 3D visual grounding through unifying
instance attribute, relation and localization perceptions. In practice, based
on the predicted target category from natural language, our model first filters
instances from panoptic segmentation on point clouds to obtain a small number
of candidates. Note that such instance-level candidates are more effective and
rational than the redundant 3D object-proposal candidates. Then, for each
candidate, we conduct the cooperative holistic scene-language understanding,
i.e., multi-level contextual referring from instance attribute perception,
instance-to-instance relation perception and instance-to-background global
localization perception. Eventually, the most relevant candidate is localized
effectively through adaptive confidence fusion. Experiments confirm that our
InstanceRefer outperforms previous state-of-the-art methods by a large margin,
i.e., 9.5% improvement on the ScanRefer benchmark (ranked 1st place) and 7.2%
improvement on Sr3D.
Related papers
- Instance-free Text to Point Cloud Localization with Relative Position Awareness [37.22900045434484]
Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration.
We address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances.
Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation.
arXiv Detail & Related papers (2024-04-27T09:46:49Z) - Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly
Supervised 3D Visual Grounding [58.924180772480504]
3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query.
We propose to leverage weakly supervised annotations to learn the 3D visual grounding model.
We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
arXiv Detail & Related papers (2023-07-18T13:49:49Z) - A Unified BEV Model for Joint Learning of 3D Local Features and Overlap
Estimation [12.499361832561634]
We present a unified bird's-eye view (BEV) model for jointly learning of 3D local features and overlap estimation.
Our method significantly outperforms existing methods on overlap prediction, especially in scenes with small overlaps.
arXiv Detail & Related papers (2023-02-28T12:01:16Z) - Not All Instances Contribute Equally: Instance-adaptive Class
Representation Learning for Few-Shot Visual Recognition [94.04041301504567]
Few-shot visual recognition refers to recognize novel visual concepts from a few labeled instances.
We propose a novel metric-based meta-learning framework termed instance-adaptive class representation learning network (ICRL-Net) for few-shot visual recognition.
arXiv Detail & Related papers (2022-09-07T10:00:18Z) - ProposalContrast: Unsupervised Pre-training for LiDAR-based 3D Object
Detection [114.54835359657707]
ProposalContrast is an unsupervised point cloud pre-training framework.
It learns robust 3D representations by contrasting region proposals.
ProposalContrast is verified on various 3D detectors.
arXiv Detail & Related papers (2022-07-26T04:45:49Z) - UniVIP: A Unified Framework for Self-Supervised Visual Pre-training [50.87603616476038]
We propose a novel self-supervised framework to learn versatile visual representations on either single-centric-object or non-iconic dataset.
Massive experiments show that UniVIP pre-trained on non-iconic COCO achieves state-of-the-art transfer performance.
Our method can also exploit single-centric-object dataset such as ImageNet and outperforms BYOL by 2.5% with the same pre-training epochs in linear probing.
arXiv Detail & Related papers (2022-03-14T10:04:04Z) - Relation-aware Instance Refinement for Weakly Supervised Visual
Grounding [44.33411132188231]
Visual grounding aims to build a correspondence between visual objects and their language entities.
We propose a novel weakly-supervised learning method that incorporates coarse-to-fine object refinement and entity relation modeling.
Experiments on two public benchmarks demonstrate the efficacy of our framework.
arXiv Detail & Related papers (2021-03-24T05:03:54Z) - Global-Local Bidirectional Reasoning for Unsupervised Representation
Learning of 3D Point Clouds [109.0016923028653]
We learn point cloud representation by bidirectional reasoning between the local structures and the global shape without human supervision.
We show that our unsupervised model surpasses the state-of-the-art supervised methods on both synthetic and real-world 3D object classification datasets.
arXiv Detail & Related papers (2020-03-29T08:26:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.