3D-STMN: Dependency-Driven Superpoint-Text Matching Network for
End-to-End 3D Referring Expression Segmentation
- URL: http://arxiv.org/abs/2308.16632v1
- Date: Thu, 31 Aug 2023 11:00:03 GMT
- Title: 3D-STMN: Dependency-Driven Superpoint-Text Matching Network for
End-to-End 3D Referring Expression Segmentation
- Authors: Changli Wu, Yiwei Ma, Qi Chen, Haowei Wang, Gen Luo, Jiayi Ji,
Xiaoshuai Sun
- Abstract summary: In 3D Referring Expression (3D-RES), the earlier approach adopts a two-stage paradigm, extracting segmentation proposals and then matching them with referring expressions.
We introduce an innovative end-to-end Superpoint-Text Matching Network (3D-STMN) that is enriched by dependency-driven insights.
Our model not only set new performance standards, registering an mIoU gain of 11.7 points but also achieve a staggering enhancement in inference speed, surpassing traditional methods by 95.7 times.
- Score: 33.20461146674787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In 3D Referring Expression Segmentation (3D-RES), the earlier approach adopts
a two-stage paradigm, extracting segmentation proposals and then matching them
with referring expressions. However, this conventional paradigm encounters
significant challenges, most notably in terms of the generation of lackluster
initial proposals and a pronounced deceleration in inference speed. Recognizing
these limitations, we introduce an innovative end-to-end Superpoint-Text
Matching Network (3D-STMN) that is enriched by dependency-driven insights. One
of the keystones of our model is the Superpoint-Text Matching (STM) mechanism.
Unlike traditional methods that navigate through instance proposals, STM
directly correlates linguistic indications with their respective superpoints,
clusters of semantically related points. This architectural decision empowers
our model to efficiently harness cross-modal semantic relationships, primarily
leveraging densely annotated superpoint-text pairs, as opposed to the more
sparse instance-text pairs. In pursuit of enhancing the role of text in guiding
the segmentation process, we further incorporate the Dependency-Driven
Interaction (DDI) module to deepen the network's semantic comprehension of
referring expressions. Using the dependency trees as a beacon, this module
discerns the intricate relationships between primary terms and their associated
descriptors in expressions, thereby elevating both the localization and
segmentation capacities of our model. Comprehensive experiments on the
ScanRefer benchmark reveal that our model not only set new performance
standards, registering an mIoU gain of 11.7 points but also achieve a
staggering enhancement in inference speed, surpassing traditional methods by
95.7 times. The code and models are available at
https://github.com/sosppxo/3D-STMN.
Related papers
- LESS: Label-Efficient and Single-Stage Referring 3D Segmentation [55.06002976797879]
Referring 3D is a visual-language task that segments all points of the specified object from a 3D point cloud described by a sentence of query.
We propose a novel Referring 3D pipeline, Label-Efficient and Single-Stage, dubbed LESS, which is only under the supervision of efficient binary mask.
We achieve state-of-the-art performance on ScanRefer dataset by surpassing the previous methods about 3.7% mIoU using only binary labels.
arXiv Detail & Related papers (2024-10-17T07:47:41Z) - SegPoint: Segment Any Point Cloud via Large Language Model [62.69797122055389]
We propose a model, called SegPoint, to produce point-wise segmentation masks across a diverse range of tasks.
SegPoint is the first model to address varied segmentation tasks within a single framework.
arXiv Detail & Related papers (2024-07-18T17:58:03Z) - SeCG: Semantic-Enhanced 3D Visual Grounding via Cross-modal Graph
Attention [19.23636231942245]
We propose a semantic-enhanced relational learning model based on a graph network with our designed memory graph attention layer.
Our method replaces original language-independent encoding with cross-modal encoding in visual analysis.
Experimental results on ReferIt3D and ScanRefer benchmarks show that the proposed method outperforms the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-13T02:11:04Z) - SAI3D: Segment Any Instance in 3D Scenes [68.57002591841034]
We introduce SAI3D, a novel zero-shot 3D instance segmentation approach.
Our method partitions a 3D scene into geometric primitives, which are then progressively merged into 3D instance segmentations.
Empirical evaluations on ScanNet, Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach.
arXiv Detail & Related papers (2023-12-17T09:05:47Z) - Coherent Entity Disambiguation via Modeling Topic and Categorical
Dependency [87.16283281290053]
Previous entity disambiguation (ED) methods adopt a discriminative paradigm, where prediction is made based on matching scores between mention context and candidate entities.
We propose CoherentED, an ED system equipped with novel designs aimed at enhancing the coherence of entity predictions.
We achieve new state-of-the-art results on popular ED benchmarks, with an average improvement of 1.3 F1 points.
arXiv Detail & Related papers (2023-11-06T16:40:13Z) - IDRNet: Intervention-Driven Relation Network for Semantic Segmentation [34.09179171102469]
Co-occurrent visual patterns suggest that pixel relation modeling facilitates dense prediction tasks.
Despite the impressive results, existing paradigms often suffer from inadequate or ineffective contextual information aggregation.
We propose a novel textbfIntervention-textbfDriven textbfRelation textbfNetwork.
arXiv Detail & Related papers (2023-10-16T18:37:33Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Superpoint Transformer for 3D Scene Instance Segmentation [7.07321040534471]
This paper proposes a novel end-to-end 3D instance segmentation method based on Superpoint Transformer, named as SPFormer.
It groups potential features from point clouds into superpoints, and directly predicts instances through query vectors.
It exceeds compared state-of-the-art methods by 4.3% on ScanNetv2 hidden test set in terms of mAP and keeps fast inference speed (247ms per frame) simultaneously.
arXiv Detail & Related papers (2022-11-28T20:52:53Z) - 3D-QueryIS: A Query-based Framework for 3D Instance Segmentation [74.6998931386331]
Previous methods for 3D instance segmentation often maintain inter-task dependencies and the tendency towards a lack of robustness.
We propose a novel query-based method, termed as 3D-QueryIS, which is detector-free, semantic segmentation-free, and cluster-free.
Our 3D-QueryIS is free from the accumulated errors caused by the inter-task dependencies.
arXiv Detail & Related papers (2022-11-17T07:04:53Z) - Detecting Human-Object Interactions with Object-Guided Cross-Modal
Calibrated Semantics [6.678312249123534]
We aim to boost end-to-end models with object-guided statistical priors.
We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy.
The above modules combined composes Object-guided Cross-modal Network (OCN)
arXiv Detail & Related papers (2022-02-01T07:39:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.