Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly
Supervised 3D Visual Grounding
- URL: http://arxiv.org/abs/2307.09267v1
- Date: Tue, 18 Jul 2023 13:49:49 GMT
- Title: Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly
Supervised 3D Visual Grounding
- Authors: Zehan Wang, Haifeng Huang, Yang Zhao, Linjun Li, Xize Cheng, Yichen
Zhu, Aoxiong Yin, Zhou Zhao
- Abstract summary: 3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query.
We propose to leverage weakly supervised annotations to learn the 3D visual grounding model.
We design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner.
- Score: 58.924180772480504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: 3D visual grounding involves finding a target object in a 3D scene that
corresponds to a given sentence query. Although many approaches have been
proposed and achieved impressive performance, they all require dense
object-sentence pair annotations in 3D point clouds, which are both
time-consuming and expensive. To address the problem that fine-grained
annotated data is difficult to obtain, we propose to leverage weakly supervised
annotations to learn the 3D visual grounding model, i.e., only coarse
scene-sentence correspondences are used to learn object-sentence links. To
accomplish this, we design a novel semantic matching model that analyzes the
semantic similarity between object proposals and sentences in a coarse-to-fine
manner. Specifically, we first extract object proposals and coarsely select the
top-K candidates based on feature and class similarity matrices. Next, we
reconstruct the masked keywords of the sentence using each candidate one by
one, and the reconstructed accuracy finely reflects the semantic similarity of
each candidate to the query. Additionally, we distill the coarse-to-fine
semantic matching knowledge into a typical two-stage 3D visual grounding model,
which reduces inference costs and improves performance by taking full advantage
of the well-studied structure of the existing architectures. We conduct
extensive experiments on ScanRefer, Nr3D, and Sr3D, which demonstrate the
effectiveness of our proposed method.
Related papers
- Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual
Grounding [4.447173454116189]
3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues.
We present EDA that Explicitly Decouples the textual attributes in a sentence.
We further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity.
arXiv Detail & Related papers (2022-09-29T17:00:22Z) - Point2Seq: Detecting 3D Objects as Sequences [58.63662049729309]
We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds.
We view each 3D object as a sequence of words and reformulate the 3D object detection task as decoding words from 3D scenes in an auto-regressive manner.
arXiv Detail & Related papers (2022-03-25T00:20:31Z) - RandomRooms: Unsupervised Pre-training from Synthetic Shapes and
Randomized Layouts for 3D Object Detection [138.2892824662943]
A promising solution is to make better use of the synthetic dataset, which consists of CAD object models, to boost the learning on real datasets.
Recent work on 3D pre-training exhibits failure when transfer features learned on synthetic objects to other real-world applications.
In this work, we put forward a new method called RandomRooms to accomplish this objective.
arXiv Detail & Related papers (2021-08-17T17:56:12Z) - SESS: Self-Ensembling Semi-Supervised 3D Object Detection [138.80825169240302]
We propose SESS, a self-ensembling semi-supervised 3D object detection framework. Specifically, we design a thorough perturbation scheme to enhance generalization of the network on unlabeled and new unseen data.
Our SESS achieves competitive performance compared to the state-of-the-art fully-supervised method by using only 50% labeled data.
arXiv Detail & Related papers (2019-12-26T08:48:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.