A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data
- URL: http://arxiv.org/abs/2511.15312v1
- Date: Wed, 19 Nov 2025 10:22:29 GMT
- Title: A Multimodal Transformer Approach for UAV Detection and Aerial Object Recognition Using Radar, Audio, and Video Data
- Authors: Mauro Larrat, Claudomiro Sales,
- Abstract summary: Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security.<n>This research addresses these challenges by designing and rigorously evaluating a novel multimodal Transformer model.<n>It integrates diverse data streams: radar, visual band video (RGB), infrared (IR) video, and audio.
- Score: 0.3093890460224435
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Unmanned aerial vehicle (UAV) detection and aerial object recognition are critical for modern surveillance and security, prompting a need for robust systems that overcome limitations of single-modality approaches. This research addresses these challenges by designing and rigorously evaluating a novel multimodal Transformer model that integrates diverse data streams: radar, visual band video (RGB), infrared (IR) video, and audio. The architecture effectively fuses distinct features from each modality, leveraging the Transformer's self-attention mechanisms to learn comprehensive, complementary, and highly discriminative representations for classification. The model demonstrated exceptional performance on an independent test set, achieving macro-averaged metrics of 0.9812 accuracy, 0.9873 recall, 0.9787 precision, 0.9826 F1-score, and 0.9954 specificity. Notably, it exhibited particularly high precision and recall in distinguishing drones from other aerial objects. Furthermore, computational analysis confirmed its efficiency, with 1.09 GFLOPs, 1.22 million parameters, and an inference speed of 41.11 FPS, highlighting its suitability for real-time applications. This study presents a significant advancement in aerial object classification, validating the efficacy of multimodal data fusion via a Transformer architecture for achieving state-of-the-art performance, thereby offering a highly accurate and resilient solution for UAV detection and monitoring in complex airspace.
Related papers
- AUDRON: A Deep Learning Framework with Fused Acoustic Signatures for Drone Type Recognition [1.8665975431697428]
Unmanned aerial vehicles (UAVs) are increasingly used across diverse domains, including logistics, agriculture, surveillance, and defense.<n> Acoustic sensing offers a low-cost and non-intrusive alternative to vision or radar-based detection.<n>This study introduces AUDRON, a hybrid deep learning framework for drone sound detection.
arXiv Detail & Related papers (2025-12-23T14:55:08Z) - A Tri-Modal Dataset and a Baseline System for Tracking Unmanned Aerial Vehicles [74.8162337823142]
MM-UAV is the first large-scale benchmark for Multi-Modal UAV Tracking.<n>The dataset spans over 30 challenging scenarios, with 1,321 synchronised multi-modal sequences, and more than 2.8 million annotated frames.<n>Accompanying the dataset, we provide a novel multi-modal multi-UAV tracking framework.
arXiv Detail & Related papers (2025-11-23T08:42:17Z) - UAV Individual Identification via Distilled RF Fingerprints-Based LLM in ISAC Networks [60.16924915676577]
Unmanned aerial vehicle (UAV) individual (ID) identification is a critical security surveillance strategy in low-altitude integrated sensing and communication (ISAC) networks.<n>We propose a novel dynamic knowledge distillation (KD)-enabled wireless radio frequency fingerprint large language model (RFF-LLM) framework for UAV ID identification.<n>Experiment results show that the proposed framework achieves 98.38% ID identification accuracy with merely 0.15 million parameters and 2.74 ms response time.
arXiv Detail & Related papers (2025-08-18T03:14:44Z) - SpectraSentinel: LightWeight Dual-Stream Real-Time Drone Detection, Tracking and Payload Identification [0.0903415485511869]
The proliferation of drones in civilian airspace has raised urgent security concerns.<n>In response to the 2025 VIP Cup challenge tasks, we propose a dual-stream drone monitoring framework.<n>Our approach deploys independent You Only Look Once v11-nano (YOLOv11n) object detectors on parallel infrared (thermal) and visible (RGB) data streams.
arXiv Detail & Related papers (2025-07-30T13:10:13Z) - A Transformer-Based Conditional GAN with Multiple Instance Learning for UAV Signal Detection and Classification [17.586093539522327]
This paper proposes a novel framework that integrates a Transformer-based Generative Adversarial Network (GAN) with Multiple Instance Locally Explainable Learning (MILET)<n> Experimental results show that the proposed method achieves superior accuracy 96.5% on the DroneDetect dataset and 98.6% on the DroneRF dataset.<n>The framework also demonstrates strong computational efficiency and robust generalization across diverse UAV platforms and flight states.
arXiv Detail & Related papers (2025-07-19T12:35:45Z) - SlowFastVAD: Video Anomaly Detection via Integrating Simple Detector and RAG-Enhanced Vision-Language Model [52.47816604709358]
Video anomaly detection (VAD) aims to identify unexpected events in videos and has wide applications in safety-critical domains.<n> vision-language models (VLMs) have demonstrated strong multimodal reasoning capabilities, offering new opportunities for anomaly detection.<n>We propose SlowFastVAD, a hybrid framework that integrates a fast anomaly detector with a slow anomaly detector.
arXiv Detail & Related papers (2025-04-14T15:30:03Z) - Resource-Efficient Beam Prediction in mmWave Communications with Multimodal Realistic Simulation Framework [57.994965436344195]
Beamforming is a key technology in millimeter-wave (mmWave) communications that improves signal transmission by optimizing directionality and intensity.<n> multimodal sensing-aided beam prediction has gained significant attention, using various sensing data to predict user locations or network conditions.<n>Despite its promising potential, the adoption of multimodal sensing-aided beam prediction is hindered by high computational complexity, high costs, and limited datasets.
arXiv Detail & Related papers (2025-04-07T15:38:25Z) - A Multi-Sensor Fusion Approach for Rapid Orthoimage Generation in Large-Scale UAV Mapping [3.321306647655686]
A multi-sensor UAV system, integrating the Global Positioning System (GPS), Inertial Measurement Unit (IMU), 4D millimeter-wave radar and camera, can provide an effective solution to this problem.<n>A prior-pose-optimized feature matching method is introduced to enhance matching speed and accuracy.<n> Experiments show that our approach achieves accurate feature matching orthoimage generation in a short time.
arXiv Detail & Related papers (2025-03-03T05:55:30Z) - Fast-COS: A Fast One-Stage Object Detector Based on Reparameterized Attention Vision Transformer for Autonomous Driving [3.617580194719686]
This paper introduces Fast-COS, a novel single-stage object detection framework crafted specifically for driving scenes.<n> RAViT achieves 81.4% Top-1 accuracy on the ImageNet-1K dataset.<n>It surpasses leading models in efficiency, delivering up to 75.9% faster GPU inference and 1.38 higher throughput on edge devices.
arXiv Detail & Related papers (2025-02-11T09:54:09Z) - DiRecNetV2: A Transformer-Enhanced Network for Aerial Disaster Recognition [4.678150356894011]
integration of Unmanned Aerial Vehicles with artificial intelligence (AI) models for aerial imagery processing in disaster assessment requires exceptional accuracy, computational efficiency, and real-time processing capabilities.
Traditionally Convolutional Neural Networks (CNNs) demonstrate efficiency in local feature extraction but are limited by their potential for global context interpretation.
Vision Transformers (ViTs) show promise for improved global context interpretation through the use of attention mechanisms, although they still remain underinvestigated in UAV-based disaster response applications.
arXiv Detail & Related papers (2024-10-17T15:25:13Z) - Robust Semi-supervised Federated Learning for Images Automatic
Recognition in Internet of Drones [57.468730437381076]
We present a Semi-supervised Federated Learning (SSFL) framework for privacy-preserving UAV image recognition.
There are significant differences in the number, features, and distribution of local data collected by UAVs using different camera modules.
We propose an aggregation rule based on the frequency of the client's participation in training, namely the FedFreq aggregation rule.
arXiv Detail & Related papers (2022-01-03T16:49:33Z) - DAE : Discriminatory Auto-Encoder for multivariate time-series anomaly
detection in air transportation [68.8204255655161]
We propose a novel anomaly detection model called Discriminatory Auto-Encoder (DAE)
It uses the baseline of a regular LSTM-based auto-encoder but with several decoders, each getting data of a specific flight phase.
Results show that the DAE achieves better results in both accuracy and speed of detection.
arXiv Detail & Related papers (2021-09-08T14:07:55Z) - ASFD: Automatic and Scalable Face Detector [129.82350993748258]
We propose a novel Automatic and Scalable Face Detector (ASFD)
ASFD is based on a combination of neural architecture search techniques as well as a new loss design.
Our ASFD-D6 outperforms the prior strong competitors, and our lightweight ASFD-D0 runs at more than 120 FPS with Mobilenet for VGA-resolution images.
arXiv Detail & Related papers (2020-03-25T06:00:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.