TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D
Visual Grounding
- URL: http://arxiv.org/abs/2108.02388v1
- Date: Thu, 5 Aug 2021 05:47:12 GMT
- Title: TransRefer3D: Entity-and-Relation Aware Transformer for Fine-Grained 3D
Visual Grounding
- Authors: Dailan he, Yusheng Zhao, Junyu Luo, Tianrui Hui, Shaofei Huang, Aixi
Zhang, Si Liu
- Abstract summary: We exploit Transformer for its natural suitability on permutation-invariant 3D point clouds data.
We propose a TransRefer3D network to extract entity-and-relation aware multimodal context.
Our proposed model significantly outperforms existing approaches by up to 10.6%.
- Score: 15.617150859765024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently proposed fine-grained 3D visual grounding is an essential and
challenging task, whose goal is to identify the 3D object referred by a natural
language sentence from other distractive objects of the same category. Existing
works usually adopt dynamic graph networks to indirectly model the
intra/inter-modal interactions, making the model difficult to distinguish the
referred object from distractors due to the monolithic representations of
visual and linguistic contents. In this work, we exploit Transformer for its
natural suitability on permutation-invariant 3D point clouds data and propose a
TransRefer3D network to extract entity-and-relation aware multimodal context
among objects for more discriminative feature learning. Concretely, we devise
an Entity-aware Attention (EA) module and a Relation-aware Attention (RA)
module to conduct fine-grained cross-modal feature matching. Facilitated by
co-attention operation, our EA module matches visual entity features with
linguistic entity features while RA module matches pair-wise visual relation
features with linguistic relation features, respectively. We further integrate
EA and RA modules into an Entity-and-Relation aware Contextual Block (ERCB) and
stack several ERCBs to form our TransRefer3D for hierarchical multimodal
context modeling. Extensive experiments on both Nr3D and Sr3D datasets
demonstrate that our proposed model significantly outperforms existing
approaches by up to 10.6% and claims the new state-of-the-art. To the best of
our knowledge, this is the first work investigating Transformer architecture
for fine-grained 3D visual grounding task.
Related papers
- Boosting Cross-Domain Point Classification via Distilling Relational Priors from 2D Transformers [59.0181939916084]
Traditional 3D networks mainly focus on local geometric details and ignore the topological structure between local geometries.
We propose a novel Priors Distillation (RPD) method to extract priors from the well-trained transformers on massive images.
Experiments on the PointDA-10 and the Sim-to-Real datasets verify that the proposed method consistently achieves the state-of-the-art performance of UDA for point cloud classification.
arXiv Detail & Related papers (2024-07-26T06:29:09Z) - RefMask3D: Language-Guided Transformer for 3D Referring Segmentation [32.11635464720755]
RefMask3D aims to explore the comprehensive multi-modal feature interaction and understanding.
RefMask3D outperforms previous state-of-the-art method by a large margin of 3.16% mIoU on the challenging ScanRefer dataset.
arXiv Detail & Related papers (2024-07-25T17:58:03Z) - Enhancing Generalizability of Representation Learning for Data-Efficient 3D Scene Understanding [50.448520056844885]
We propose a generative Bayesian network to produce diverse synthetic scenes with real-world patterns.
A series of experiments robustly display our method's consistent superiority over existing state-of-the-art pre-training approaches.
arXiv Detail & Related papers (2024-06-17T07:43:53Z) - SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - SM$^3$: Self-Supervised Multi-task Modeling with Multi-view 2D Images
for Articulated Objects [24.737865259695006]
We propose a self-supervised interaction perception method, referred to as SM$3$, to model articulated objects.
By constructing 3D geometries and textures from the captured 2D images, SM$3$ achieves integrated optimization of movable part and joint parameters.
Evaluations demonstrate that SM$3$ surpasses existing benchmarks across various categories and objects, while its adaptability in real-world scenarios has been thoroughly validated.
arXiv Detail & Related papers (2024-01-17T11:15:09Z) - Four Ways to Improve Verbo-visual Fusion for Dense 3D Visual Grounding [56.00186960144545]
3D visual grounding is the task of localizing the object in a 3D scene which is referred by a description in natural language.
We propose a dense 3D grounding network, featuring four novel stand-alone modules that aim to improve grounding performance.
arXiv Detail & Related papers (2023-09-08T19:27:01Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - Exploring Optical-Flow-Guided Motion and Detection-Based Appearance for
Temporal Sentence Grounding [61.57847727651068]
Temporal sentence grounding aims to localize a target segment in an untrimmed video semantically according to a given sentence query.
Most previous works focus on learning frame-level features of each whole frame in the entire video, and directly match them with the textual information.
We propose a novel Motion- and Appearance-guided 3D Semantic Reasoning Network (MA3SRN), which incorporates optical-flow-guided motion-aware, detection-based appearance-aware, and 3D-aware object-level features.
arXiv Detail & Related papers (2022-03-06T13:57:09Z) - The Devil is in the Task: Exploiting Reciprocal Appearance-Localization
Features for Monocular 3D Object Detection [62.1185839286255]
Low-cost monocular 3D object detection plays a fundamental role in autonomous driving.
We introduce a Dynamic Feature Reflecting Network, named DFR-Net.
We rank 1st among all the monocular 3D object detectors in the KITTI test set.
arXiv Detail & Related papers (2021-12-28T07:31:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.