Instance-free Text to Point Cloud Localization with Relative Position Awareness
- URL: http://arxiv.org/abs/2404.17845v1
- Date: Sat, 27 Apr 2024 09:46:49 GMT
- Title: Instance-free Text to Point Cloud Localization with Relative Position Awareness
- Authors: Lichao Wang, Zhihao Yuan, Jinke Ren, Shuguang Cui, Zhen Li,
- Abstract summary: Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration.
We address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances.
Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation.
- Score: 37.22900045434484
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration. It seeks to localize a position from a city-scale point cloud scene based on a few natural language instructions. In this paper, we address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances. Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation. In both stages, we introduce an instance query extractor, in which the cells are encoded by a 3D sparse convolution U-Net to generate the multi-scale point cloud features, and a set of queries iteratively attend to these features to represent instances. In the coarse stage, a row-column relative position-aware self-attention (RowColRPA) module is designed to capture the spatial relations among the instance queries. In the fine stage, a multi-modal relative position-aware cross-attention (RPCA) module is developed to fuse the text and point cloud features along with spatial relations for improving fine position estimation. Experiment results on the KITTI360Pose dataset demonstrate that our model achieves competitive performance with the state-of-the-art models without taking ground-truth instances as input.
Related papers
- Coupled Laplacian Eigenmaps for Locally-Aware 3D Rigid Point Cloud Matching [0.0]
We propose a new technique, based on graph Laplacian eigenmaps, to match point clouds by taking into account fine local structures.
To deal with the order and sign ambiguity of Laplacian eigenmaps, we introduce a new operator, called Coupled Laplacian.
We show that the similarity between those aligned high-dimensional spaces provides a locally meaningful score to match shapes.
arXiv Detail & Related papers (2024-02-27T10:10:12Z) - EipFormer: Emphasizing Instance Positions in 3D Instance Segmentation [51.996943482875366]
We present a novel Transformer-based architecture, EipFormer, which comprises progressive aggregation and dual position embedding.
EipFormer achieves superior or comparable performance compared to state-of-the-art approaches.
arXiv Detail & Related papers (2023-12-09T16:08:47Z) - Collect-and-Distribute Transformer for 3D Point Cloud Analysis [82.03517861433849]
We propose a new transformer network equipped with a collect-and-distribute mechanism to communicate short- and long-range contexts of point clouds.
Results show the effectiveness of the proposed CDFormer, delivering several new state-of-the-art performances on point cloud classification and segmentation tasks.
arXiv Detail & Related papers (2023-06-02T03:48:45Z) - Position-Guided Point Cloud Panoptic Segmentation Transformer [118.17651196656178]
This work begins by applying this appealing paradigm to LiDAR-based point cloud segmentation and obtains a simple yet effective baseline.
We observe that instances in the sparse point clouds are relatively small to the whole scene and often have similar geometry but lack distinctive appearance for segmentation, which are rare in the image domain.
The method, named Position-guided Point cloud Panoptic segmentation transFormer (P3Former), outperforms previous state-of-the-art methods by 3.4% and 1.2% on Semantic KITTI and nuScenes benchmark, respectively.
arXiv Detail & Related papers (2023-03-23T17:59:02Z) - A Unified BEV Model for Joint Learning of 3D Local Features and Overlap
Estimation [12.499361832561634]
We present a unified bird's-eye view (BEV) model for jointly learning of 3D local features and overlap estimation.
Our method significantly outperforms existing methods on overlap prediction, especially in scenes with small overlaps.
arXiv Detail & Related papers (2023-02-28T12:01:16Z) - Text to Point Cloud Localization with Relation-Enhanced Transformer [14.635206837740231]
We focus on the text-to-point-cloud cross-modal localization problem.
It aims to identify the described location from city-scale point clouds.
We propose a unified Relation-Enhanced Transformer (RET) to improve representation discriminability.
arXiv Detail & Related papers (2023-01-13T02:58:49Z) - Adaptive Edge-to-Edge Interaction Learning for Point Cloud Analysis [118.30840667784206]
Key issue for point cloud data processing is extracting useful information from local regions.
Previous works ignore the relation between edges in local regions, which encodes the local shape information.
This paper proposes a novel Adaptive Edge-to-Edge Interaction Learning module.
arXiv Detail & Related papers (2022-11-20T07:10:14Z) - SE(3)-Equivariant Attention Networks for Shape Reconstruction in
Function Space [50.14426188851305]
We propose the first SE(3)-equivariant coordinate-based network for learning occupancy fields from point clouds.
In contrast to previous shape reconstruction methods that align the input to a regular grid, we operate directly on the irregular, unoriented point cloud.
We show that our method outperforms previous SO(3)-equivariant methods, as well as non-equivariant methods trained on SO(3)-augmented datasets.
arXiv Detail & Related papers (2022-04-05T17:59:15Z) - Multi-Scale Representation Learning for Spatial Feature Distributions
using Grid Cells [11.071527762096053]
We propose a representation learning model called Space2Vec to encode the absolute positions and spatial relationships of places.
Results show that because of its multi-scale representations, Space2Vec outperforms well-established ML approaches.
arXiv Detail & Related papers (2020-02-16T04:22:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.