Related papers: Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

URL: http://arxiv.org/abs/2311.12751v4
Date: Wed, 31 Jul 2024 08:24:16 GMT
Title: Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching
Authors: Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, Tat-Seng Chua,
Abstract summary: Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets. We introduce GeoText-1652, a new natural language-guided geo-localization benchmark. This dataset is systematically constructed through an interactive human-computer process.
Score: 60.645802236700035
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Navigating drones through natural language commands remains challenging due to the dearth of accessible multi-modal datasets and the stringent precision requirements for aligning visual and textual data. To address this pressing need, we introduce GeoText-1652, a new natural language-guided geo-localization benchmark. This dataset is systematically constructed through an interactive human-computer process leveraging Large Language Model (LLM) driven annotation techniques in conjunction with pre-trained vision models. GeoText-1652 extends the established University-1652 image dataset with spatial-aware text annotations, thereby establishing one-to-one correspondences between image, text, and bounding box elements. We further introduce a new optimization objective to leverage fine-grained spatial associations, called blending spatial matching, for region-level spatial relation matching. Extensive experiments reveal that our approach maintains a competitive recall rate comparing other prevailing cross-modality methods. This underscores the promising potential of our approach in elevating drone control and navigation through the seamless integration of natural language commands in real-world scenarios.

Related papers

OmniGeo: Towards a Multimodal Large Language Models for Geospatial Artificial Intelligence [51.0456395687016]
multimodal large language models (LLMs) have opened new frontiers in artificial intelligence. We propose a MLLM (OmniGeo) tailored to geospatial applications. By combining the strengths of natural language understanding and spatial reasoning, our model enhances the ability of instruction following and the accuracy of GeoAI systems.
arXiv Detail & Related papers (2025-03-20T16:45:48Z)
SJTU:Spatial judgments in multimodal models towards unified segmentation through coordinate detection [4.930667479611019]
This paper introduces SJTU: Spatial Judgments in Multimodal Models - Towards Unified through Coordinate Detection. It presents an approach for integrating segmentation techniques with vision-language models through spatial inference in multimodal space. We demonstrate superior performance across benchmark datasets, achieving IoU scores of 0.5958 on COCO 2017 and 0.6758 on Pascal VOC.
arXiv Detail & Related papers (2024-12-03T16:53:58Z)
Swarm Intelligence in Geo-Localization: A Multi-Agent Large Vision-Language Model Collaborative Framework [51.26566634946208]
We introduce smileGeo, a novel visual geo-localization framework. By inter-agent communication, smileGeo integrates the inherent knowledge of these agents with additional retrieved information. Results show that our approach significantly outperforms current state-of-the-art methods.
arXiv Detail & Related papers (2024-08-21T03:31:30Z)
Into the Unknown: Generating Geospatial Descriptions for New Environments [18.736071151303726]
Rendezvous task requires reasoning over allocentric spatial relationships. Using opensource descriptions paired with coordinates (e.g., Wikipedia) provides training data but suffers from limited spatially-oriented text. We propose a large-scale augmentation method for generating high-quality synthetic data for new environments.
arXiv Detail & Related papers (2024-06-28T14:56:21Z)
RTGen: Generating Region-Text Pairs for Open-Vocabulary Object Detection [20.630629383286262]
Open-vocabulary object detection requires solid modeling of the region-semantic relationship. We propose RTGen to generate scalable open-vocabulary region-text pairs.
arXiv Detail & Related papers (2024-05-30T09:03:23Z)
Think Global, Act Local: Dual-scale Graph Transformer for Vision-and-Language Navigation [87.03299519917019]
We propose a dual-scale graph transformer (DUET) for joint long-term action planning and fine-grained cross-modal understanding. We build a topological map on-the-fly to enable efficient exploration in global action space. The proposed approach, DUET, significantly outperforms state-of-the-art methods on goal-oriented vision-and-language navigation benchmarks.
arXiv Detail & Related papers (2022-02-23T19:06:53Z)
Exploring Explicit and Implicit Visual Relationships for Image Captioning [11.82805641934772]
In this paper, we explore explicit and implicit visual relationships to enrich region-level representations for image captioning. Explicitly, we build semantic graph over object pairs and exploit gated graph convolutional networks (Gated GCN) to selectively aggregate local neighbors' information. Implicitly, we draw global interactions among the detected objects through region-based bidirectional encoder representations from transformers.
arXiv Detail & Related papers (2021-05-06T01:47:51Z)
Towards Natural Language Question Answering over Earth Observation Linked Data using Attention-based Neural Machine Translation [0.0]
This paper seeks to study and analyze the use of RNN-based neural machine translation with attention for transforming natural language questions into GeoSPARQL queries. A dataset consisting of mappings from natural language questions to GeoSPARQL queries over the Corine Land Cover(CLC) Linked Data has been created to train and validate the deep neural network.
arXiv Detail & Related papers (2021-01-23T06:12:20Z)
Geography-Aware Self-Supervised Learning [79.4009241781968]
We show that due to their different characteristics, a non-trivial gap persists between contrastive and supervised learning on standard benchmarks. We propose novel training methods that exploit the spatially aligned structure of remote sensing data. Our experiments show that our proposed method closes the gap between contrastive and supervised learning on image classification, object detection and semantic segmentation for remote sensing.
arXiv Detail & Related papers (2020-11-19T17:29:13Z)
SIRI: Spatial Relation Induced Network For Spatial Description Resolution [64.38872296406211]
We propose a novel relationship induced (SIRI) network for language-guided localization. We show that our method is around 24% better than the state-of-the-art method in terms of accuracy, measured by an 80-pixel radius. Our method also generalizes well on our proposed extended dataset collected using the same settings as Touchdown.
arXiv Detail & Related papers (2020-10-27T14:04:05Z)
Language and Visual Entity Relationship Graph for Agent Navigation [54.059606864535304]
Vision-and-Language Navigation (VLN) requires an agent to navigate in a real-world environment following natural language instructions. We propose a novel Language and Visual Entity Relationship Graph for modelling the inter-modal relationships between text and vision. Experiments show that by taking advantage of the relationships we are able to improve over state-of-the-art.
arXiv Detail & Related papers (2020-10-19T08:25:55Z)
Spatial Language Representation with Multi-Level Geocoding [15.376256625525391]
We present a multi-level geocoding model (MLG) that learns to associate texts to geographic locations. We show that MLG obtains state-of-the-art results for toponym resolution on three English datasets.
arXiv Detail & Related papers (2020-08-21T00:05:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.