WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments
- URL: http://arxiv.org/abs/2603.01475v1
- Date: Mon, 02 Mar 2026 05:39:12 GMT
- Title: WildCross: A Cross-Modal Large Scale Benchmark for Place Recognition and Metric Depth Estimation in Natural Environments
- Authors: Joshua Knights, Joseph Reid, Kaushik Roy, David Hall, Mark Cox, Peyman Moghadam,
- Abstract summary: WildCross is a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments.<n>We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation.
- Score: 11.037873142796682
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent years have seen a significant increase in demand for robotic solutions in unstructured natural environments, alongside growing interest in bridging 2D and 3D scene understanding. However, existing robotics datasets are predominantly captured in structured urban environments, making them inadequate for addressing the challenges posed by complex, unstructured natural settings. To address this gap, we propose WildCross, a cross-modal benchmark for place recognition and metric depth estimation in large-scale natural environments. WildCross comprises over 476K sequential RGB frames with semi-dense depth and surface normal annotations, each aligned with accurate 6DoF poses and synchronized dense lidar submaps. We conduct comprehensive experiments on visual, lidar, and cross-modal place recognition, as well as metric depth estimation, demonstrating the value of WildCross as a challenging benchmark for multi-modal robotic perception tasks. We provide access to the code repository and dataset at https://csiro-robotics.github.io/WildCross.
Related papers
- Seeing the Unseen: Mask-Driven Positional Encoding and Strip-Convolution Context Modeling for Cross-View Object Geo-Localization [8.559240391514063]
Cross-view object geo-localization enables high-precision object localization through cross-view matching.<n>Existing methods rely on keypoint-based positional encoding, which captures only 2D coordinates while neglecting object shape information.<n>We propose a mask-based positional encoding scheme that leverages segmentation masks to capture both spatial coordinates and object silhouettes.<n>We present EDGeo, an end-to-end framework for robust cross-view object geo-localization.
arXiv Detail & Related papers (2025-10-23T06:07:07Z) - ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving [62.9051914830949]
We present ROVR, a large-scale, diverse, and cost-efficient depth dataset designed to capture the complexity of real-world driving.<n>A lightweight acquisition pipeline ensures scalable collection, while sparse but statistically sufficient ground truth supports robust training.<n> Benchmarking with state-of-the-art monocular depth models reveals severe cross-dataset generalization failures.
arXiv Detail & Related papers (2025-08-19T16:13:49Z) - HOTFormerLoc: Hierarchical Octree Transformer for Versatile Lidar Place Recognition Across Ground and Aerial Views [30.77381516091565]
We present HOTFormerLoc, a novel and versatile Hierarchical Octree-based TransFormer, for large-scale 3D place recognition.<n>We propose an octree-based multi-scale attention mechanism that captures spatial and semantic features across granularities.<n>We introduce CS-Wild-Places, a novel 3D cross-source dataset featuring point cloud data from aerial and ground lidar scans captured in dense forests.
arXiv Detail & Related papers (2025-03-11T07:59:45Z) - OpenEarthSensing: Large-Scale Fine-Grained Benchmark for Open-World Remote Sensing [57.050679160659705]
We introduce textbfOpenEarthSensing (OES), a large-scale fine-grained benchmark for open-world remote sensing.<n>OES includes 189 scene and object categories, covering the vast majority of potential semantic shifts that may occur in the real world.
arXiv Detail & Related papers (2025-02-28T02:49:52Z) - PFSD: A Multi-Modal Pedestrian-Focus Scene Dataset for Rich Tasks in Semi-Structured Environments [73.80718037070773]
We present the multi-modal Pedestrian-Focused Scene dataset, rigorously annotated in semi-structured scenes with the format of nuScenes.<n>We also propose a novel Hybrid Multi-Scale Fusion Network (HMFN) to detect pedestrians in densely populated and occluded scenarios.
arXiv Detail & Related papers (2025-02-21T09:57:53Z) - WildScenes: A Benchmark for 2D and 3D Semantic Segmentation in Large-scale Natural Environments [33.25040383298019]
$WildScenes$ is a bi-modal benchmark dataset consisting of high-resolution 2D images and dense 3D LiDAR point clouds.
The data is trajectory-centric with accurate localization and globally aligned point clouds.
Our 3D semantic labels are obtained via an efficient, automated process that transfers the human-annotated 2D labels from multiple views into 3D point cloud sequences.
arXiv Detail & Related papers (2023-12-23T22:27:40Z) - VoxelKP: A Voxel-based Network Architecture for Human Keypoint
Estimation in LiDAR Data [53.638818890966036]
textitVoxelKP is a novel fully sparse network architecture tailored for human keypoint estimation in LiDAR data.
We introduce sparse box-attention to focus on learning spatial correlations between keypoints within each human instance.
We incorporate a spatial encoding to leverage absolute 3D coordinates when projecting 3D voxels to a 2D grid encoding a bird's eye view.
arXiv Detail & Related papers (2023-12-11T23:50:14Z) - Unsupervised Spike Depth Estimation via Cross-modality Cross-domain Knowledge Transfer [53.413305467674434]
We introduce open-source RGB data to support spike depth estimation, leveraging its annotations and spatial information.
We propose a cross-modality cross-domain (BiCross) framework to realize unsupervised spike depth estimation.
Our method achieves state-of-the-art (SOTA) performances, compared with RGB-oriented unsupervised depth estimation methods.
arXiv Detail & Related papers (2022-08-26T09:35:20Z) - Semi-Perspective Decoupled Heatmaps for 3D Robot Pose Estimation from
Depth Maps [66.24554680709417]
Knowing the exact 3D location of workers and robots in a collaborative environment enables several real applications.
We propose a non-invasive framework based on depth devices and deep neural networks to estimate the 3D pose of robots from an external camera.
arXiv Detail & Related papers (2022-07-06T08:52:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.