Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
- URL: http://arxiv.org/abs/2311.12751v4
- Date: Wed, 31 Jul 2024 08:24:16 GMT
- Title: Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
- Authors: Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, Tat-Seng Chua,
- Abstract summary: Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets.
We introduce GeoText-1652, a new natural language-guided geo-localization benchmark.
This dataset is systematically constructed through an interactive human-computer process.
- Score: 60.645802236700035
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geo-localization benchmark. This dataset is systematically constructed through an interactive human-computer process leveraging Large Language Model (LLM) driven annotation techniques in conjunction with pre-trained vision models. GeoText-1652 extends the established University-1652 image dataset with spatial-aware text annotations, thereby establishing one-to-one correspondences between image, text, and bounding box elements. We further introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains a competitive recall rate comparing other prevailing cross-modality methods. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.
Related papers
- Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework.
By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information.
Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z) - Into the Unknown: Generating Geospatial Descriptions for New Environments [18.736071151303726]
Rendezvous task requires reasoning over allocentric spatial relationships.
Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text.
We propose a large-scale augmentation method for generating high-quality synthetic data for new environments.
arXiv Detail & Related papers (2024-06-28T14:56:21Z) - RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection [20.630629383286262]
Open-vocabulary object detection requires solid modeling of the region-semantic relationship.
We propose RTGen to generate scalable open-vocabulary region-text pairs.
arXiv Detail & Related papers (2024-05-30T09:03:23Z) - Think Global, Act Local: Dual-scale Graph Transformer for
Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding.
We build a topological map on-the-fly to enable efficient exploration in global action space.
The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z) - Exploring Explicit and Implicit Visual Relationships for Image
Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning.
Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information.
Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z) - Towards Natural Language Question Answering over Earth Observation
Linked Data using Attention-based Neural Machine Translation [0.0]
This paper seeks to study and analyze the use of RNN-based neural machine translation with attention for transforming natural language questions into GeoSPARQL queries.
A dataset consisting of mappings from natural language questions to GeoSPARQL queries over the Corine Land Cover(CLC) Linked Data has been created to train and validate the deep neural network.
arXiv Detail & Related papers (2021-01-23T06:12:20Z) - Geography-Aware Self-Supervised Learning [79.4009241781968]
We show that due to their different characteristics, a non-trivial gap persists between contrastive and supervised learning on standard benchmarks.
We propose novel training methods that exploit the spatially aligned structure of remote sensing data.
Our experiments show that our proposed method closes the gap between contrastive and supervised learning on image classification, object detection and semantic segmentation for remote sensing.
arXiv Detail & Related papers (2020-11-19T17:29:13Z) - SIRI: Spatial Relation Induced Network For Spatial Description
Resolution [64.38872296406211]
We propose a novel relationship induced (SIRI) network for language-guided localization.
We show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius.
Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
arXiv Detail & Related papers (2020-10-27T14:04:05Z) - Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions.
We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision.
Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z) - Spatial Language Representation with Multi-Level Geocoding [15.376256625525391]
We present a multi-level geocoding model (MLG) that learns to associate texts to geographic locations.
We show that MLG obtains state-of-the-art results for toponym resolution on three English datasets.
arXiv Detail & Related papers (2020-08-21T00:05:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.