Related papers: AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval

AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval

URL: http://arxiv.org/abs/2601.01416v1
Date: Sun, 04 Jan 2026 07:38:51 GMT
Title: AirSpatialBot: A Spatially-Aware Aerial Agent for Fine-Grained Vehicle Attribute Recognization and Retrieval
Authors: Yue Zhou, Ran Ding, Xue Yang, Xue Jiang, Xingzhao Liu,
Abstract summary: We introduce a spatially-aware dataset AirSpatial, which comprises over 206K instructions.<n>It is the first remote sensing grounding dataset to provide 3DBB.<n>We develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval.
Score: 25.233263762328836
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite notable advancements in remote sensing vision-language models (VLMs), existing models often struggle with spatial understanding, limiting their effectiveness in real-world applications. To push the boundaries of VLMs in remote sensing, we specifically address vehicle imagery captured by drones and introduce a spatially-aware dataset AirSpatial, which comprises over 206K instructions and introduces two novel tasks: Spatial Grounding and Spatial Question Answering. It is also the first remote sensing grounding dataset to provide 3DBB. To effectively leverage existing image understanding of VLMs to spatial domains, we adopt a two-stage training strategy comprising Image Understanding Pre-training and Spatial Understanding Fine-tuning. Utilizing this trained spatially-aware VLM, we develop an aerial agent, AirSpatialBot, which is capable of fine-grained vehicle attribute recognition and retrieval. By dynamically integrating task planning, image understanding, spatial understanding, and task execution capabilities, AirSpatialBot adapts to diverse query requirements. Experimental results validate the effectiveness of our approach, revealing the spatial limitations of existing VLMs while providing valuable insights. The model, code, and datasets will be released at https://github.com/VisionXLab/AirSpatialBot

Related papers

AerialMind: Towards Referring Multi-Object Tracking in UAV Scenarios [64.51320327698231]
We introduce AerialMind, the first large-scale RMOT benchmark in UAV scenarios.<n>We develop an innovative semi-automated collaborative agent-based labeling assistant framework.<n>We also propose HawkEyeTrack, a novel method that collaboratively enhances vision-language representation learning.
arXiv Detail & Related papers (2025-11-26T04:44:27Z)
Space-Aware Instruction Tuning: Dataset and Benchmark for Guide Dog Robots Assisting the Visually Impaired [0.2410625015892047]
We introduce the Space-Aware Instruction Tuning (SAIT) dataset and the Space-Aware Benchmark (SA-Bench)<n>Our data generation pipeline focuses on the virtual path to the destination in 3D space and the surroundings.<n>We propose an evaluation protocol to assess VLM effectiveness in delivering walking guidance.
arXiv Detail & Related papers (2025-02-11T02:14:49Z)
Open-World Drone Active Tracking with Goal-Centered Rewards [62.21394499788672]
Drone Visual Active Tracking aims to autonomously follow a target object by controlling the motion system based on visual observations.<n>We propose DAT, the first open-world drone active air-to-ground tracking benchmark.<n>We also propose GC-VAT, which aims to improve the performance of drone tracking targets in complex scenarios.
arXiv Detail & Related papers (2024-12-01T09:37:46Z)
TWIST & SCOUT: Grounding Multimodal LLM-Experts by Forget-Free Tuning [54.033346088090674]
We introduce TWIST & SCOUT, a framework that equips pre-trained MLLMs with visual grounding ability.<n>To fine-tune the model effectively, we generate a high-quality synthetic dataset we call SCOUT.<n>This dataset provides rich supervision signals, describing a step-by-step multimodal reasoning process.
arXiv Detail & Related papers (2024-10-14T13:35:47Z)
Exploring Spatial Representation to Enhance LLM Reasoning in Aerial Vision-Language Navigation [11.267956604072845]
Aerial Vision-and-Language Navigation (VLN) is a novel task enabling Unmanned Aerial Vehicles (UAVs) to navigate in outdoor environments through natural language instructions and visual cues.<n>We propose a training-free, zero-shot framework for aerial VLN tasks, where the large language model (LLM) is leveraged as the agent for action prediction.
arXiv Detail & Related papers (2024-10-11T03:54:48Z)
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy [56.505551117094534]
We introduce LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as visuo-textual conversations.<n>First, we present an automated pipeline to generate conversation-style instruction tuning data for robots from existing behavior cloning datasets.<n>We show that a VLM finetuned with a limited amount of such datasets can produce meaningful action decisions for robotic control.
arXiv Detail & Related papers (2024-06-28T17:59:12Z)
SpatialBot: Precise Spatial Understanding with Vision Language Models [12.67089704185187]
Vision Language Models (VLMs) have achieved impressive performance in 2D image understanding.<n>They are still struggling with spatial understanding which is the foundation of Embodied AI.<n>In this paper, we propose SpatialBot for better spatial understanding by feeding both RGB and depth images.
arXiv Detail & Related papers (2024-06-19T15:41:30Z)
Ensuring UAV Safety: A Vision-only and Real-time Framework for Collision Avoidance Through Object Detection, Tracking, and Distance Estimation [16.671696289301625]
This paper presents a deep-learning framework that utilizes optical sensors for the detection, tracking, and distance estimation of non-cooperative aerial vehicles. In this work, we propose a method for estimating the distance information of a detected aerial object in real time using only the input of a monocular camera.
arXiv Detail & Related papers (2024-05-10T18:06:41Z)
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities [59.39858959066982]
understanding and reasoning about spatial relationships is a fundamental capability for Visual Question Answering (VQA) and robotics. We develop an automatic 3D spatial VQA data generation framework that scales up to 2 billion VQA examples on 10 million real-world images. By training a VLM on such data, we significantly enhance its ability on both qualitative and quantitative spatial VQA.
arXiv Detail & Related papers (2024-01-22T18:01:01Z)
Task-Oriented Image Transmission for Scene Classification in Unmanned Aerial Systems [46.64800170644672]
We propose a new aerial image transmission paradigm for the scene classification task. A lightweight model is developed on the front-end UAV for semantic blocks transmission with perception of images and channel conditions. In order to achieve the tradeoff between transmission latency and classification accuracy, deep reinforcement learning is used.
arXiv Detail & Related papers (2021-12-21T02:44:49Z)
Perceiving Traffic from Aerial Images [86.994032967469]
We propose an object detection method called Butterfly Detector that is tailored to detect objects in aerial images. We evaluate our Butterfly Detector on two publicly available UAV datasets (UAVDT and VisDrone 2019) and show that it outperforms previous state-of-the-art methods while remaining real-time.
arXiv Detail & Related papers (2020-09-16T11:37:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.