Free-form Description Guided 3D Visual Graph Network for Object
Grounding in Point Cloud
- URL: http://arxiv.org/abs/2103.16381v1
- Date: Tue, 30 Mar 2021 14:22:36 GMT
- Title: Free-form Description Guided 3D Visual Graph Network for Object
Grounding in Point Cloud
- Authors: Mingtao Feng, Zhen Li, Qi Li, Liang Zhang, XiangDong Zhang, Guangming
Zhu, Hui Zhang, Yaonan Wang and Ajmal Mian
- Abstract summary: 3D object grounding aims to locate the most relevant target object in a raw point cloud scene based on a free-form language description.
We propose a language scene graph module to capture the rich structure and long-distance phrase correlations.
Secondly, we introduce a multi-level 3D proposal relation graph module to extract the object-object and object-scene co-occurrence relationships.
- Score: 39.055928838826226
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: 3D object grounding aims to locate the most relevant target object in a raw
point cloud scene based on a free-form language description. Understanding
complex and diverse descriptions, and lifting them directly to a point cloud is
a new and challenging topic due to the irregular and sparse nature of point
clouds. There are three main challenges in 3D object grounding: to find the
main focus in the complex and diverse description; to understand the point
cloud scene; and to locate the target object. In this paper, we address all
three challenges. Firstly, we propose a language scene graph module to capture
the rich structure and long-distance phrase correlations. Secondly, we
introduce a multi-level 3D proposal relation graph module to extract the
object-object and object-scene co-occurrence relationships, and strengthen the
visual features of the initial proposals. Lastly, we develop a description
guided 3D visual graph module to encode global contexts of phrases and
proposals by a nodes matching strategy. Extensive experiments on challenging
benchmark datasets (ScanRefer and Nr3D) show that our algorithm outperforms
existing state-of-the-art. Our code is available at
https://github.com/PNXD/FFL-3DOG.
Related papers
- Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships [15.513180297629546]
We present Open3DSG, an alternative approach to learn 3D scene graph prediction in an open world without requiring labeled scene graph data.
We co-embed the features from a 3D scene graph prediction backbone with the feature space of powerful open world 2D vision language foundation models.
arXiv Detail & Related papers (2024-02-19T16:15:03Z) - TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes [67.5351491691866]
We present a novel framework, dubbed TeMO, to parse multi-object 3D scenes and edit their styles.
Our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes.
arXiv Detail & Related papers (2023-12-07T12:10:05Z) - 3DRP-Net: 3D Relative Position-aware Network for 3D Visual Grounding [58.924180772480504]
3D visual grounding aims to localize the target object in a 3D point cloud by a free-form language description.
We propose a relation-aware one-stage framework, named 3D Relative Position-aware Network (3-Net)
arXiv Detail & Related papers (2023-07-25T09:33:25Z) - Joint Representation Learning for Text and 3D Point Cloud [35.67281936143821]
We propose a novel Text4Point framework to construct language-guided 3D point cloud models.
The proposed Text4Point follows the pre-training and fine-tuning paradigm.
Our model shows consistent improvement on various downstream tasks, such as point cloud semantic segmentation, instance segmentation, and object detection.
arXiv Detail & Related papers (2023-01-18T15:02:07Z) - Contextual Modeling for 3D Dense Captioning on Point Clouds [85.68339840274857]
3D dense captioning, as an emerging vision-language task, aims to identify and locate each object from a set of point clouds.
We propose two separate modules, namely the Global Context Modeling (GCM) and Local Context Modeling (LCM), in a coarse-to-fine manner.
Our proposed model can effectively characterize the object representations and contextual information.
arXiv Detail & Related papers (2022-10-08T05:33:00Z) - EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual
Grounding [4.447173454116189]
3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues.
We present EDA that Explicitly Decouples the textual attributes in a sentence.
We further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity.
arXiv Detail & Related papers (2022-09-29T17:00:22Z) - MORE: Multi-Order RElation Mining for Dense Captioning in 3D Scenes [89.75025195440287]
Existing methods only treat such relations as by-products of object feature learning in graphs without specifically encoding them.
We propose MORE, a Multi-Order RElation mining model, to support generating more descriptive and comprehensive captions.
Our MORE encodes object relations in a progressive manner since complex relations can be deduced from a limited number of basic ones.
arXiv Detail & Related papers (2022-03-10T07:26:15Z) - Single Image 3D Object Estimation with Primitive Graph Networks [30.315124364682994]
Reconstructing 3D object from a single image is a fundamental problem in visual scene understanding.
We propose a two-stage graph network for primitive-based 3D object estimation.
We train the entire graph neural network in a stage-wise strategy and evaluate it on three benchmarks: Pix3D, ModelNet and NYU Depth V2.
arXiv Detail & Related papers (2021-09-09T10:28:37Z) - Learning 3D Semantic Scene Graphs from 3D Indoor Reconstructions [94.17683799712397]
We focus on scene graphs, a data structure that organizes the entities of a scene in a graph.
We propose a learned method that regresses a scene graph from the point cloud of a scene.
We show the application of our method in a domain-agnostic retrieval task, where graphs serve as an intermediate representation for 3D-3D and 2D-3D matching.
arXiv Detail & Related papers (2020-04-08T12:25:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.