PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation
- URL: http://arxiv.org/abs/2406.19632v2
- Date: Thu, 11 Jul 2024 10:15:27 GMT
- Title: PPTFormer: Pseudo Multi-Perspective Transformer for UAV Segmentation
- Authors: Deyi Ji, Wenwei Jin, Hongtao Lu, Feng Zhao,
- Abstract summary: We introduce the PPTFormer, a novel textbfPseudo Multi-textbfPerspective textbfTranstextbfformer network.
Our approach circumvents the need for actual multi-perspective data by creating pseudo perspectives for enhanced multi-perspective learning.
- Score: 18.585299793391748
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The ascension of Unmanned Aerial Vehicles (UAVs) in various fields necessitates effective UAV image segmentation, which faces challenges due to the dynamic perspectives of UAV-captured images. Traditional segmentation algorithms falter as they cannot accurately mimic the complexity of UAV perspectives, and the cost of obtaining multi-perspective labeled datasets is prohibitive. To address these issues, we introduce the PPTFormer, a novel \textbf{P}seudo Multi-\textbf{P}erspective \textbf{T}rans\textbf{former} network that revolutionizes UAV image segmentation. Our approach circumvents the need for actual multi-perspective data by creating pseudo perspectives for enhanced multi-perspective learning. The PPTFormer network boasts Perspective Representation, novel Perspective Prototypes, and a specialized encoder and decoder that together achieve superior segmentation results through Pseudo Multi-Perspective Attention (PMP Attention) and fusion. Our experiments demonstrate that PPTFormer achieves state-of-the-art performance across five UAV segmentation datasets, confirming its capability to effectively simulate UAV flight perspectives and significantly advance segmentation precision. This work presents a pioneering leap in UAV scene understanding and sets a new benchmark for future developments in semantic segmentation.
Related papers
- Semantic Segmentation of Unmanned Aerial Vehicle Remote Sensing Images using SegFormer [0.14999444543328289]
This paper evaluates the effectiveness and efficiency of SegFormer, a semantic segmentation framework, for the semantic segmentation of UAV images.
SegFormer variants, ranging in real-time (B0) to high-performance (B5) models, are assessed using the UAVid dataset tailored for semantic segmentation tasks.
Experimental results showcase the model's performance on benchmark dataset, highlighting its ability to accurately delineate objects and land cover features in diverse UAV scenarios.
arXiv Detail & Related papers (2024-10-01T21:40:15Z) - UAV (Unmanned Aerial Vehicles): Diverse Applications of UAV Datasets in Segmentation, Classification, Detection, and Tracking [0.0]
Unmanned Aerial Vehicles (UAVs) have revolutionized the process of gathering and analyzing data in diverse research domains.
UAV datasets consist of various types of data, such as satellite imagery, images captured by drones, and videos.
These datasets play a crucial role in disaster damage assessment, aerial surveillance, object recognition, and tracking.
arXiv Detail & Related papers (2024-09-05T04:47:36Z) - UCDNet: Multi-UAV Collaborative 3D Object Detection Network by Reliable Feature Mapping [14.401624713578737]
Multi-UAV collaborative 3D object detection can perceive and comprehend complex environments.
We propose an unparalleled camera-based multi-UAV collaborative 3D object detection paradigm called UCDNet.
We show our method increases 4.7% and 10% mAP respectively compared to the baseline.
arXiv Detail & Related papers (2024-06-07T05:27:32Z) - View-Centric Multi-Object Tracking with Homographic Matching in Moving UAV [43.37259596065606]
We address the challenge of multi-object tracking (MOT) in moving Unmanned Aerial Vehicle (UAV) scenarios.
Changes in the scene background not only render traditional frame-to-frame object IOU association methods ineffective but also introduce significant view shifts in the objects.
We propose a novel universal HomView-MOT framework, which for the first time harnesses the view Homography inherent in changing scenes to solve MOT challenges.
arXiv Detail & Related papers (2024-03-16T06:48:33Z) - Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [83.85856356798531]
VistaLLM is a visual system that addresses coarse- and fine-grained vision-language tasks.
It employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences.
We also introduce a novel task, AttCoSeg, which boosts the model's reasoning and grounding capability over multiple input images.
arXiv Detail & Related papers (2023-12-19T18:53:01Z) - Joint Depth Prediction and Semantic Segmentation with Multi-View SAM [59.99496827912684]
We propose a Multi-View Stereo (MVS) technique for depth prediction that benefits from rich semantic features of the Segment Anything Model (SAM)
This enhanced depth prediction, in turn, serves as a prompt to our Transformer-based semantic segmentation decoder.
arXiv Detail & Related papers (2023-10-31T20:15:40Z) - CROVIA: Seeing Drone Scenes from Car Perspective via Cross-View
Adaptation [20.476683921252867]
We propose a novel Cross-View Adaptation (CROVIA) approach to adapt the knowledge learned from on-road vehicle views to UAV views.
First, a novel geometry-based constraint to cross-view adaptation is introduced based on the geometry correlation between views.
Second, cross-view correlations from image space are effectively transferred to segmentation space without any requirement of paired on-road and UAV view data.
arXiv Detail & Related papers (2023-04-14T15:20:40Z) - Rethinking Range View Representation for LiDAR Segmentation [66.73116059734788]
"Many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections.
We present RangeFormer, a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing.
We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks.
arXiv Detail & Related papers (2023-03-09T16:13:27Z) - BEVerse: Unified Perception and Prediction in Birds-Eye-View for
Vision-Centric Autonomous Driving [92.05963633802979]
We present BEVerse, a unified framework for 3D perception and prediction based on multi-camera systems.
We show that the multi-task BEVerse outperforms single-task methods on 3D object detection, semantic map construction, and motion prediction.
arXiv Detail & Related papers (2022-05-19T17:55:35Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - PnP-DETR: Towards Efficient Visual Analysis with Transformers [146.55679348493587]
Recently, DETR pioneered the solution vision tasks with transformers, it directly translates the image feature map into the object result.
Recent transformer-based image recognition model andTT show consistent efficiency gain.
arXiv Detail & Related papers (2021-09-15T01:10:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.