Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language
- URL: http://arxiv.org/abs/2511.15308v1
- Date: Wed, 19 Nov 2025 10:19:45 GMT
- Title: Text2Loc++: Generalizing 3D Point Cloud Localization from Natural Language
- Authors: Yan Xia, Letian Shi, Yilin Di, Joao F. Henriques, Daniel Cremers,
- Abstract summary: We present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds.<n>To support benchmarking, we introduce a new city-scale dataset covering color and non-color point clouds from diverse urban scenes.<n>In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics.<n>In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework.
- Score: 44.7011717447999
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle the problem of localizing 3D point cloud submaps using complex and diverse natural language descriptions, and present Text2Loc++, a novel neural network designed for effective cross-modal alignment between language and point clouds in a coarse-to-fine localization pipeline. To support benchmarking, we introduce a new city-scale dataset covering both color and non-color point clouds from diverse urban scenes, and organize location descriptions into three levels of linguistic complexity. In the global place recognition stage, Text2Loc++ combines a pretrained language model with a Hierarchical Transformer with Max pooling (HTM) for sentence-level semantics, and employs an attention-based point cloud encoder for spatial understanding. We further propose Masked Instance Training (MIT) to filter out non-aligned objects and improve multimodal robustness. To enhance the embedding space, we introduce Modality-aware Hierarchical Contrastive Learning (MHCL), incorporating cross-modal, submap-, text-, and instance-level losses. In the fine localization stage, we completely remove explicit text-instance matching and design a lightweight yet powerful framework based on Prototype-based Map Cloning (PMC) and a Cascaded Cross-Attention Transformer (CCAT). Extensive experiments on the KITTI360Pose dataset show that Text2Loc++ outperforms existing methods by up to 15%. In addition, the proposed model exhibits robust generalization when evaluated on the new dataset, effectively handling complex linguistic expressions and a wide variety of urban environments. The code and dataset will be made publicly available.
Related papers
- Lemon: A Unified and Scalable 3D Multimodal Model for Universal Spatial Understanding [80.66591664266744]
Lemon is a unified transformer architecture that processes 3D point cloud patches and language tokens as a single sequence.<n>To handle the complexity of 3D data, we develop a structured patchification and tokenization scheme that preserves spatial context.<n>Lemon establishes new state-of-the-art performance across comprehensive 3D understanding and reasoning tasks.
arXiv Detail & Related papers (2025-12-14T20:02:43Z) - Gen-LangSplat: Generalized Language Gaussian Splatting with Pre-Trained Feature Compression [0.0]
We introduce Gen-LangSplat, that replaces the scene-wise autoencoder with a generalized autoencoder, pre-trained extensively on the large-scale ScanNet dataset.<n>This architectural shift enables the use of a fixed, compact latent space for language features across any new scene without any scene-specific training.<n>Our results demonstrate that generalized embeddings can efficiently and accurately support open-vocabulary querying in novel 3D scenes.
arXiv Detail & Related papers (2025-10-27T02:13:38Z) - Affogato: Learning Open-Vocabulary Affordance Grounding with Automated Data Generation at Scale [41.693908591580175]
We develop vision-language models that leverage pretrained part-aware vision backbones and a text-conditional heatmap decoder.<n>Our models achieve promising performance on the existing 2D and 3D benchmarks, and notably, exhibit effectiveness in open-vocabulary cross-domain generalization.
arXiv Detail & Related papers (2025-06-13T17:57:18Z) - Instance-free Text to Point Cloud Localization with Relative Position Awareness [37.22900045434484]
Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration.
We address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances.
Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation.
arXiv Detail & Related papers (2024-04-27T09:46:49Z) - Text2Loc: 3D Point Cloud Localization from Natural Language [49.01851743372889]
We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions.
We introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text.
Text2Loc improves the localization accuracy by up to $2times$ over the state-of-the-art on the KITTI360Pose dataset.
arXiv Detail & Related papers (2023-11-27T16:23:01Z) - PointLLM: Empowering Large Language Models to Understand Point Clouds [63.39876878899682]
PointLLM understands colored object point clouds with human instructions.
It generates contextually appropriate responses, illustrating its grasp of point clouds and common sense.
arXiv Detail & Related papers (2023-08-31T17:59:46Z) - LCPFormer: Towards Effective 3D Point Cloud Analysis via Local Context
Propagation in Transformers [60.51925353387151]
We propose a novel module named Local Context Propagation (LCP) to exploit the message passing between neighboring local regions.
We use the overlap points of adjacent local regions as intermediaries, then re-weight the features of these shared points from different local regions before passing them to the next layers.
The proposed method is applicable to different tasks and outperforms various transformer-based methods in benchmarks including 3D shape classification and dense prediction tasks.
arXiv Detail & Related papers (2022-10-23T15:43:01Z) - Campus3D: A Photogrammetry Point Cloud Benchmark for Hierarchical
Understanding of Outdoor Scene [76.4183572058063]
We present a richly-annotated 3D point cloud dataset for multiple outdoor scene understanding tasks.
The dataset has been point-wisely annotated with both hierarchical and instance-based labels.
We formulate a hierarchical learning problem for 3D point cloud segmentation and propose a measurement evaluating consistency across various hierarchies.
arXiv Detail & Related papers (2020-08-11T19:10:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.