Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization
- URL: http://arxiv.org/abs/2411.13610v1
- Date: Wed, 20 Nov 2024 01:52:49 GMT
- Title: Video2BEV: Transforming Drone Videos to BEVs for Video-based Geo-localization
- Authors: Hao Ju, Zhedong Zheng,
- Abstract summary: We formulate a new video-based drone geo-localization task and propose the Video2BEV paradigm.
This paradigm transforms the video into a Bird's Eye View (BEV), simplifying the subsequent matching process.
To validate our approach, we introduce UniV, a new video-based geo-localization dataset.
- Score: 19.170572975810497
- License:
- Abstract: Existing approaches to drone visual geo-localization predominantly adopt the image-based setting, where a single drone-view snapshot is matched with images from other platforms. Such task formulation, however, underutilizes the inherent video output of the drone and is sensitive to occlusions and environmental constraints. To address these limitations, we formulate a new video-based drone geo-localization task and propose the Video2BEV paradigm. This paradigm transforms the video into a Bird's Eye View (BEV), simplifying the subsequent matching process. In particular, we employ Gaussian Splatting to reconstruct a 3D scene and obtain the BEV projection. Different from the existing transform methods, \eg, polar transform, our BEVs preserve more fine-grained details without significant distortion. To further improve model scalability toward diverse BEVs and satellite figures, our Video2BEV paradigm also incorporates a diffusion-based module for generating hard negative samples, which facilitates discriminative feature learning. To validate our approach, we introduce UniV, a new video-based geo-localization dataset that extends the image-based University-1652 dataset. UniV features flight paths at $30^\circ$ and $45^\circ$ elevation angles with increased frame rates of up to 10 frames per second (FPS). Extensive experiments on the UniV dataset show that our Video2BEV paradigm achieves competitive recall rates and outperforms conventional video-based methods. Compared to other methods, our proposed approach exhibits robustness at lower elevations with more occlusions.
Related papers
- Robust Bird's Eye View Segmentation by Adapting DINOv2 [3.236198583140341]
We adapt a vision foundational model, DINOv2, to BEV estimation using Low Rank Adaptation (LoRA)
Our experiments show increased robustness of BEV perception under various corruptions.
We also showcase the effectiveness of the adapted representations in terms of fewer learnable parameters and faster convergence during training.
arXiv Detail & Related papers (2024-09-16T12:23:35Z) - GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers [53.80009458891537]
Cross-view video geo-localization aims to derive GPS trajectories from street-view videos by aligning them with aerial-view images.
Current CVGL methods use camera and odometry data, typically absent in real-world scenarios.
We propose GAReT, a fully transformer-based method for CVGL that does not require camera and odometry data.
arXiv Detail & Related papers (2024-08-05T21:29:33Z) - SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix [60.48666051245761]
We propose a pose-free and training-free approach for generating 3D stereoscopic videos.
Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth.
We develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting.
arXiv Detail & Related papers (2024-06-29T08:33:55Z) - GenDeF: Learning Generative Deformation Field for Video Generation [89.49567113452396]
We propose to render a video by warping one static image with a generative deformation field (GenDeF)
Such a pipeline enjoys three appealing advantages.
arXiv Detail & Related papers (2023-12-07T18:59:41Z) - FB-BEV: BEV Representation from Forward-Backward View Transformations [131.11787050205697]
We propose a novel View Transformation Module (VTM) for Bird-Eye-View (BEV) representation.
We instantiate the proposed module with FB-BEV, which achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set.
arXiv Detail & Related papers (2023-08-04T10:26:55Z) - BEVControl: Accurately Controlling Street-view Elements with
Multi-perspective Consistency via BEV Sketch Layout [17.389444754562252]
We propose a two-stage generative method, dubbed BEVControl, that can generate accurate foreground and background contents.
Our experiments show that our BEVControl surpasses the state-of-the-art method, BEVGen, by a significant margin.
arXiv Detail & Related papers (2023-08-03T09:56:31Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration [20.733451121484993]
We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration.
This is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) for a multi-person scene.
We propose an end-to-end framework solving this problem, whose main idea can be divided into following parts.
arXiv Detail & Related papers (2022-12-19T08:31:08Z) - Vision-based Uneven BEV Representation Learning with Polar Rasterization
and Surface Estimation [42.071461405587264]
We propose PolarBEV for vision-based uneven BEV representation learning.
PolarBEV keeps real-time inference speed on a single 2080Ti GPU.
arXiv Detail & Related papers (2022-07-05T08:20:36Z) - VideoGPT: Video Generation using VQ-VAE and Transformers [75.20543171520565]
VideoGG is a conceptually simple architecture for scaling likelihood based generative modeling to natural videos.
VideoG uses VQ-E that learns downsampled discrete latent representations by employing 3D convolutions and axial self-attention.
Our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the B-101 Robot dataset.
arXiv Detail & Related papers (2021-04-20T17:58:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.