Related papers: CVD-SfM: A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes

CVD-SfM: A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes

URL: http://arxiv.org/abs/2508.01936v1
Date: Sun, 03 Aug 2025 22:11:48 GMT
Title: CVD-SfM: A Cross-View Deep Front-end Structure-from-Motion System for Sparse Localization in Multi-Altitude Scenes
Authors: Yaxuan Li, Yewei Huang, Bijay Gaudel, Hamidreza Jafarnejadsani, Brendan Englot,
Abstract summary: We present a novel multi-altitude camera pose estimation system, addressing the challenges of robust and accurate localization across varied altitudes.<n>The system effectively handles diverse environmental conditions and viewpoint variations by integrating the cross-view transformer, deep features, and structure-from-motion.
Score: 0.7623023317942882
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present a novel multi-altitude camera pose estimation system, addressing the challenges of robust and accurate localization across varied altitudes when only considering sparse image input. The system effectively handles diverse environmental conditions and viewpoint variations by integrating the cross-view transformer, deep features, and structure-from-motion into a unified framework. To benchmark our method and foster further research, we introduce two newly collected datasets specifically tailored for multi-altitude camera pose estimation; datasets of this nature remain rare in the current literature. The proposed framework has been validated through extensive comparative analyses on these datasets, demonstrating that our system achieves superior performance in both accuracy and robustness for multi-altitude sparse pose estimation tasks compared to existing solutions, making it well suited for real-world robotic applications such as aerial navigation, search and rescue, and automated inspection.

Related papers

Multi-modal Multi-platform Person Re-Identification: Benchmark and Method [58.59888754340054]
MP-ReID is a novel dataset designed specifically for multi-modality and multi-platform ReID.<n>This benchmark compiles data from 1,930 identities across diverse modalities, including RGB, infrared, and thermal imaging.<n>We introduce Uni-Prompt ReID, a framework with specific-designed prompts, tailored for cross-modality and cross-platform scenarios.
arXiv Detail & Related papers (2025-03-21T12:27:49Z)
PFSD: A Multi-Modal Pedestrian-Focus Scene Dataset for Rich Tasks in Semi-Structured Environments [73.80718037070773]
We present the multi-modal Pedestrian-Focused Scene dataset, rigorously annotated in semi-structured scenes with the format of nuScenes.<n>We also propose a novel Hybrid Multi-Scale Fusion Network (HMFN) to detect pedestrians in densely populated and occluded scenarios.
arXiv Detail & Related papers (2025-02-21T09:57:53Z)
HV-BEV: Decoupling Horizontal and Vertical Feature Sampling for Multi-View 3D Object Detection [34.72603963887331]
The application of vision-based multi-view environmental perception system has been increasingly recognized in autonomous driving technology.<n>Current state-of-the-art solutions primarily encode image features from each camera view into the BEV space through explicit or implicit depth prediction.<n>We propose a novel approach that decouples feature sampling in the textbfBEV grid queries paradigm into textbfHorizontal feature aggregation.
arXiv Detail & Related papers (2024-12-25T11:49:14Z)
GVDepth: Zero-Shot Monocular Depth Estimation for Ground Vehicles based on Probabilistic Cue Fusion [7.588468985212172]
Generalizing metric monocular depth estimation presents a significant challenge due to its ill-posed nature.<n>We propose a novel canonical representation that maintains consistency across varied camera setups.<n>We also propose a novel architecture that adaptively and probabilistically fuses depths estimated via object size and vertical image position cues.
arXiv Detail & Related papers (2024-12-08T22:04:34Z)
A Novel Wide-Area Multiobject Detection System with High-Probability Region Searching [8.934161308155131]
This paper presents a hybrid system that incorporates a wide-angle camera, a high-speed search camera, and a galvano-mirror. In this system, the wide-angle camera offers panoramic images as prior information, which helps the search camera capture detailed images of the targeted objects.
arXiv Detail & Related papers (2024-05-07T18:06:40Z)
Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving [22.58849429006898]
Current multi-view depth estimation methods or single-view and multi-view fusion methods will fail when given noisy pose settings. We propose a single-view and multi-view fused depth estimation system, which adaptively integrates high-confident multi-view and single-view results. Our method outperforms state-of-the-art multi-view and fusion methods under robustness testing.
arXiv Detail & Related papers (2024-03-12T11:18:35Z)
View Consistent Purification for Accurate Cross-View Localization [59.48131378244399]
This paper proposes a fine-grained self-localization method for outdoor robotics. The proposed method addresses limitations in existing cross-view localization methods. It is the first sparse visual-only method that enhances perception in dynamic environments.
arXiv Detail & Related papers (2023-08-16T02:51:52Z)
Multi-Frame Self-Supervised Depth Estimation with Multi-Scale Feature Fusion in Dynamic Scenes [25.712707161201802]
Multi-frame methods improve monocular depth estimation over single-frame approaches. Recent methods tend to propose complex architectures for feature matching and dynamic scenes. We show that a simple learning framework, together with designed feature augmentation, leads to superior performance.
arXiv Detail & Related papers (2023-03-26T05:26:30Z)
6D Camera Relocalization in Visually Ambiguous Extreme Environments [79.68352435957266]
We propose a novel method to reliably estimate the pose of a camera given a sequence of images acquired in extreme environments such as deep seas or extraterrestrial terrains. Our method achieves comparable performance with state-of-the-art methods on the indoor benchmark (7-Scenes dataset) using only 20% training data.
arXiv Detail & Related papers (2022-07-13T16:40:02Z)
Self-supervised Human Detection and Segmentation via Multi-view Consensus [116.92405645348185]
We propose a multi-camera framework in which geometric constraints are embedded in the form of multi-view consistency during training. We show that our approach outperforms state-of-the-art self-supervised person detection and segmentation techniques on images that visually depart from those of standard benchmarks.
arXiv Detail & Related papers (2020-12-09T15:47:21Z)
AutoPose: Searching Multi-Scale Branch Aggregation for Pose Estimation [96.29533512606078]
We present AutoPose, a novel neural architecture search(NAS) framework. It is capable of automatically discovering multiple parallel branches of cross-scale connections towards accurate and high-resolution 2D human pose estimation.
arXiv Detail & Related papers (2020-08-16T22:27:43Z)
Contextual Encoder-Decoder Network for Visual Saliency Prediction [42.047816176307066]
We propose an approach based on a convolutional neural network pre-trained on a large-scale image classification task. We combine the resulting representations with global scene information for accurately predicting visual saliency. Compared to state of the art approaches, the network is based on a lightweight image classification backbone.
arXiv Detail & Related papers (2019-02-18T16:15:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.