Text2Pos: Text-to-Point-Cloud Cross-Modal Localization
- URL: http://arxiv.org/abs/2203.15125v1
- Date: Mon, 28 Mar 2022 22:06:00 GMT
- Title: Text2Pos: Text-to-Point-Cloud Cross-Modal Localization
- Authors: Manuel Kolmet, Qunjie Zhou, Aljosa Osep, Laura Leal-Taixe
- Abstract summary: Cross-modal text-to-point-cloud localization can allow us to specify a vehicle pick-up or goods delivery location.
We propose Text2Pos, a cross-modal localization module that learns to align textual descriptions with localization cues in a coarse- to-fine manner.
Our experiments show that we can localize 65% of textual queries within 15m distance to query locations for top-10 retrieved locations.
- Score: 12.984256838490795
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Natural language-based communication with mobile devices and home appliances
is becoming increasingly popular and has the potential to become natural for
communicating with mobile robots in the future. Towards this goal, we
investigate cross-modal text-to-point-cloud localization that will allow us to
specify, for example, a vehicle pick-up or goods delivery location. In
particular, we propose Text2Pos, a cross-modal localization module that learns
to align textual descriptions with localization cues in a coarse- to-fine
manner. Given a point cloud of the environment, Text2Pos locates a position
that is specified via a natural language-based description of the immediate
surroundings. To train Text2Pos and study its performance, we construct
KITTI360Pose, the first dataset for this task based on the recently introduced
KITTI360 dataset. Our experiments show that we can localize 65% of textual
queries within 15m distance to query locations for top-10 retrieved locations.
This is a starting point that we hope will spark future developments towards
language-based navigation.
Related papers
- Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering [50.52792174648067]
This initiative seeks to bridge the gap between textual and visual comprehension.
We propose a new multi-task Urdu scene text dataset comprising over 1000 natural scene images.
We provide fine-grained annotations for text instances, addressing the limitations of previous datasets.
arXiv Detail & Related papers (2024-05-21T06:48:26Z) - Instance-free Text to Point Cloud Localization with Relative Position Awareness [37.22900045434484]
Text-to-point-cloud cross-modal localization is an emerging vision-language task critical for future robot-human collaboration.
We address two key limitations of existing approaches: 1) their reliance on ground-truth instances as input; and 2) their neglect of the relative positions among potential instances.
Our proposed model follows a two-stage pipeline, including a coarse stage for text-cell retrieval and a fine stage for position estimation.
arXiv Detail & Related papers (2024-04-27T09:46:49Z) - Text2Loc: 3D Point Cloud Localization from Natural Language [49.01851743372889]
We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions.
We introduce a novel neural network, Text2Loc, that fully interprets the semantic relationship between points and text.
Text2Loc improves the localization accuracy by up to $2times$ over the state-of-the-art on the KITTI360Pose dataset.
arXiv Detail & Related papers (2023-11-27T16:23:01Z) - Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching [60.645802236700035]
Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets.
We introduce GeoText-1652, a new natural language-guided geo-localization benchmark.
This dataset is systematically constructed through an interactive human-computer process.
arXiv Detail & Related papers (2023-11-21T17:52:30Z) - TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture.
TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling.
It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z) - Text to Point Cloud Localization with Relation-Enhanced Transformer [14.635206837740231]
We focus on the text-to-point-cloud cross-modal localization problem.
It aims to identify the described location from city-scale point clouds.
We propose a unified Relation-Enhanced Transformer (RET) to improve representation discriminability.
arXiv Detail & Related papers (2023-01-13T02:58:49Z) - DORi: Discovering Object Relationship for Moment Localization of a
Natural-Language Query in Video [98.54696229182335]
We study the task of temporal moment localization in a long untrimmed video using natural language query.
Our key innovation is to learn a video feature embedding through a language-conditioned message-passing algorithm.
A temporal sub-graph captures the activities within the video through time.
arXiv Detail & Related papers (2020-10-13T09:50:29Z) - Language Guided Networks for Cross-modal Moment Retrieval [66.49445903955777]
Cross-modal moment retrieval aims to localize a temporal segment from an untrimmed video described by a natural language query.
Existing methods independently extract the features of videos and sentences.
We present Language Guided Networks (LGN), a new framework that leverages the sentence embedding to guide the whole process of moment retrieval.
arXiv Detail & Related papers (2020-06-18T12:08:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.