Text2Loc: 3D Point Cloud Localization from Natural Language
- URL: http://arxiv.org/abs/2311.15977v2
- Date: Thu, 28 Mar 2024 09:31:05 GMT
- Title: Text2Loc: 3D Point Cloud Localization from Natural Language
- Authors: Yan Xia, Letian Shi, Zifeng Ding, João F. Henriques, Daniel Cremers,
- Abstract summary: We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions.
We introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text.
Text2Loc improves the localization accuracy by up to $2times$ over the state-of-the-art on the KITTI360Pose dataset.
- Score: 49.01851743372889
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition, followed by fine localization. In global place recognition, relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM), whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover, we propose a novel matching-free fine localization method to further refine the location predictions, which completely removes the need for complicated text-instance matching and is lighter, faster, and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to $2\times$ over the state-of-the-art on the KITTI360Pose dataset. Our project page is publicly available at \url{https://yan-xia.github.io/projects/text2loc/}.
Related papers
- Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt [10.17947324152468]
Region prompt tuning method decomposes region text prompt into individual characters and splits visual feature map into region visual tokens.
This allows a character matches the local features of a token, thereby avoiding the omission of detailed features and fine-grained text.
Our proposed method combines a general score map from the image-text process with a region score map derived from character-token matching.
arXiv Detail & Related papers (2024-09-20T15:24:26Z) - Instance-free Text to Point Cloud Localization with Relative Position Awareness [37.22900045434484]
Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration.
We address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances.
Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation.
arXiv Detail & Related papers (2024-04-27T09:46:49Z) - Efficiently Leveraging Linguistic Priors for Scene Text Spotting [63.22351047545888]
This paper proposes a method that leverages linguistic knowledge from a large text corpus to replace the traditional one-hot encoding used in auto-regressive scene text spotting and recognition models.
We generate text distributions that align well with scene text datasets, removing the need for in-domain fine-tuning.
Experimental results show that our method not only improves recognition accuracy but also enables more accurate localization of words.
arXiv Detail & Related papers (2024-02-27T01:57:09Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Text to Point Cloud Localization with Relation-Enhanced Transformer [14.635206837740231]
We focus on the text-to-point-cloud cross-modal localization problem.
It aims to identify the described location from city-scale point clouds.
We propose a unified Relation-Enhanced Transformer (RET) to improve representation discriminability.
arXiv Detail & Related papers (2023-01-13T02:58:49Z) - Text2Pos: Text-to-Point-Cloud Cross-Modal Localization [12.984256838490795]
Cross-modal text-to-point-cloud localization can allow us to specify a vehicle pick-up or goods delivery location.
We propose Text2Pos, a cross-modal localization module that learns to align textual descriptions with localization cues in a coarse- to-fine manner.
Our experiments show that we can localize 65% of textual queries within 15m distance to query locations for top-10 retrieved locations.
arXiv Detail & Related papers (2022-03-28T22:06:00Z) - SwinTextSpotter: Scene Text Spotting via Better Synergy between Text
Detection and Text Recognition [73.61592015908353]
We propose a new end-to-end scene text spotting framework termed SwinTextSpotter.
Using a transformer with dynamic head as the detector, we unify the two tasks with a novel Recognition Conversion mechanism.
The design results in a concise framework that requires neither additional rectification module nor character-level annotation.
arXiv Detail & Related papers (2022-03-19T01:14:42Z) - SSC: Semantic Scan Context for Large-Scale Place Recognition [13.228580954956342]
We explore the use of high-level features, namely semantics, to improve the representation ability of descriptors.
We propose a novel global descriptor, Semantic Scan Context, which explores semantic information to represent scenes more effectively.
Our approach outperforms the state-of-the-art methods with a large margin.
arXiv Detail & Related papers (2021-07-01T11:51:19Z) - T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval [59.990432265734384]
Text-video retrieval is a challenging task that aims to search relevant video contents based on natural language descriptions.
Most existing methods only consider the global cross-modal similarity and overlook the local details.
In this paper, we design an efficient global-local alignment method.
We achieve consistent improvements on three standard text-video retrieval benchmarks and outperform the state-of-the-art by a clear margin.
arXiv Detail & Related papers (2021-04-20T15:26:24Z) - ContourNet: Taking a Further Step toward Accurate Arbitrary-shaped Scene
Text Detection [147.10751375922035]
We propose the ContourNet, which effectively handles false positives and large scale variance of scene texts.
Our method effectively suppresses these false positives by only outputting predictions with high response value in both directions.
arXiv Detail & Related papers (2020-04-10T08:15:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.