Related papers: Semantically Aware UAV Landing Site Assessment from Remote Sensing Imagery via Multimodal Large Language Models

Semantically Aware UAV Landing Site Assessment from Remote Sensing Imagery via Multimodal Large Language Models

URL: http://arxiv.org/abs/2602.01163v1
Date: Sun, 01 Feb 2026 11:30:03 GMT
Title: Semantically Aware UAV Landing Site Assessment from Remote Sensing Imagery via Multimodal Large Language Models
Authors: Chunliang Hua, Zeyuan Yang, Lei Zhang, Jiayang Sun, Fengwen Chen, Chunlan Zeng, Xiao Hu,
Abstract summary: Safe UAV emergency landing requires understanding complex semantic risks invisible to traditional geometric sensors.<n>We propose a novel framework leveraging Remote Sensing (RS) imagery and Multimodal Large Language Models (MLLMs) for context-aware landing site assessment.
Score: 5.987458168544856
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Safe UAV emergency landing requires more than just identifying flat terrain; it demands understanding complex semantic risks (e.g., crowds, temporary structures) invisible to traditional geometric sensors. In this paper, we propose a novel framework leveraging Remote Sensing (RS) imagery and Multimodal Large Language Models (MLLMs) for global context-aware landing site assessment. Unlike local geometric methods, our approach employs a coarse-to-fine pipeline: first, a lightweight semantic segmentation module efficiently pre-screens candidate areas; second, a vision-language reasoning agent fuses visual features with Point-of-Interest (POI) data to detect subtle hazards. To validate this approach, we construct and release the Emergency Landing Site Selection (ELSS) benchmark. Experiments demonstrate that our framework significantly outperforms geometric baselines in risk identification accuracy. Furthermore, qualitative results confirm its ability to generate human-like, interpretable justifications, enhancing trust in automated decision-making. The benchmark dataset is publicly accessible at https://anonymous.4open.science/r/ELSS-dataset-43D7.

Related papers

Vision-Language Feature Alignment for Road Anomaly Segmentation [38.2615882515309]
We propose a vision-language anomaly segmentation framework that incorporates semantic priors from pre-trained Vision-Language Models (VLMs)<n>Specifically, we design a prompt learning-driven alignment module that adapts Mask2Forme's visual features to CLIP text embeddings of known categories.<n>At inference time, we introduce a multi-source inference strategy that integrates text-guided similarity, CLIP-based image-text similarity and detector confidence.
arXiv Detail & Related papers (2026-03-01T10:17:00Z)
OSDA: A Framework for Open-Set Discovery and Automatic Interpretation of Land-cover in Remote Sensing Imagery [10.196580289786414]
Open-set land-cover analysis in remote sensing requires the ability to achieve fine-grained spatial localization and semantically open categorization.<n>We introduce OSDA, an integrated three-stage framework for annotation-free open-set land-cover discovery, segmentation, and description.<n>Our work provides a scalable and interpretable solution for dynamic land-cover monitoring, showing strong potential for automated cartographic updating and large-scale earth observation analysis.
arXiv Detail & Related papers (2025-09-23T06:23:56Z)
RIS-LAD: A Benchmark and Model for Referring Low-Altitude Drone Image Segmentation [26.836547579041067]
Referring ImageHide (RIS) aims to segment specific objects based on natural language descriptions.<n>Existing datasets and methods are typically designed for high-altitude and static-view imagery.<n>We present RIS-LAD, the first fine-grained RIS benchmark tailored for Low-Altitude Drone (LAD) scenarios.
arXiv Detail & Related papers (2025-07-28T15:21:03Z)
Towards a Multi-Agent Vision-Language System for Zero-Shot Novel Hazardous Object Detection for Autonomous Driving Safety [0.0]
We propose a multimodal approach that integrates vision-language reasoning with zero-shot object detection.<n>We refine object detection by incorporating OpenAI's CLIP model to match predicted hazards with bounding box annotations.<n>Our findings highlight the strengths and limitations of current vision-language-based approaches.
arXiv Detail & Related papers (2025-04-18T01:25:02Z)
SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model [61.97017867656831]
We introduce a new task, ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region.<n>We construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs.<n>SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods.
arXiv Detail & Related papers (2025-04-13T16:36:47Z)
Open-Set Deepfake Detection: A Parameter-Efficient Adaptation Method with Forgery Style Mixture [81.93945602120453]
We introduce an approach that is both general and parameter-efficient for face forgery detection.<n>We design a forgery-style mixture formulation that augments the diversity of forgery source domains.<n>We show that the designed model achieves state-of-the-art generalizability with significantly reduced trainable parameters.
arXiv Detail & Related papers (2024-08-23T01:53:36Z)
Federated Adversarial Learning for Robust Autonomous Landing Runway Detection [6.029462194041386]
In this paper, we propose a federated adversarial learning-based framework to detect landing runways. To the best of our knowledge, this marks the first instance of federated learning work that address the adversarial sample problem in landing runway detection.
arXiv Detail & Related papers (2024-06-22T19:31:52Z)
Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning [52.06176253457522]
We propose a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning. CFINet achieves state-of-the-art performance on the large-scale small object detection benchmarks, SODA-D and SODA-A.
arXiv Detail & Related papers (2023-08-18T13:13:09Z)
Towards General Visual-Linguistic Face Forgery Detection [95.73987327101143]
Deepfakes are realistic face manipulations that can pose serious threats to security, privacy, and trust. Existing methods mostly treat this task as binary classification, which uses digital labels or mask signals to train the detection model. We propose a novel paradigm named Visual-Linguistic Face Forgery Detection(VLFFD), which uses fine-grained sentence-level prompts as the annotation.
arXiv Detail & Related papers (2023-07-31T10:22:33Z)
Language-Guided 3D Object Detection in Point Cloud for Autonomous Driving [91.91552963872596]
We propose a new multi-modal visual grounding task, termed LiDAR Grounding. It jointly learns the LiDAR-based object detector with the language features and predicts the targeted region directly from the detector. Our work offers a deeper insight into the LiDAR-based grounding task and we expect it presents a promising direction for the autonomous driving community.
arXiv Detail & Related papers (2023-05-25T06:22:10Z)
Bayesian Deep Learning for Segmentation for Autonomous Safe Planetary Landing [7.1581738936972]
This paper proposes an application of the Bayesian deep-learning segmentation method for hazard detection.<n>It generates simultaneously a safety prediction map and its uncertainty map via Bayesian deep learning and semantic segmentation.<n>Experiments are presented with simulated data based on a Mars HiRISE digital terrain model.
arXiv Detail & Related papers (2021-02-21T08:13:49Z)
MRDet: A Multi-Head Network for Accurate Oriented Object Detection in Aerial Images [51.227489316673484]
We propose an arbitrary-oriented region proposal network (AO-RPN) to generate oriented proposals transformed from horizontal anchors. To obtain accurate bounding boxes, we decouple the detection task into multiple subtasks and propose a multi-head network. Each head is specially designed to learn the features optimal for the corresponding task, which allows our network to detect objects accurately.
arXiv Detail & Related papers (2020-12-24T06:36:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.