Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation
- URL: http://arxiv.org/abs/2511.08258v1
- Date: Wed, 12 Nov 2025 01:49:25 GMT
- Title: Top2Ground: A Height-Aware Dual Conditioning Diffusion Model for Robust Aerial-to-Ground View Generation
- Authors: Jae Joong Lee, Bedrich Benes,
- Abstract summary: Top2Ground is a novel diffusion-based method that directly generates ground-view images from aerial input images.<n>We condition the denoising process on a joint representation of VAE-encoded spatial features.<n>Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.
- Score: 14.377332218510743
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating ground-level images from aerial views is a challenging task due to extreme viewpoint disparity, occlusions, and a limited field of view. We introduce Top2Ground, a novel diffusion-based method that directly generates photorealistic ground-view images from aerial input images without relying on intermediate representations such as depth maps or 3D voxels. Specifically, we condition the denoising process on a joint representation of VAE-encoded spatial features (derived from aerial RGB images and an estimated height map) and CLIP-based semantic embeddings. This design ensures the generation is both geometrically constrained by the scene's 3D structure and semantically consistent with its content. We evaluate Top2Ground on three diverse datasets: CVUSA, CVACT, and the Auto Arborist. Our approach shows 7.3% average improvement in SSIM across three benchmark datasets, showing Top2Ground can robustly handle both wide and narrow fields of view, highlighting its strong generalization capabilities.
Related papers
- Splat-SAP: Feed-Forward Gaussian Splatting for Human-Centered Scene with Scale-Aware Point Map Reconstruction [39.835146541795986]
We present Splat-SAP, a feed-forward approach to render views of human-centered scenes from binocular cameras with large sparsity.<n>We leverage pixel-wise point map reconstruction to represent geometry which is robust to large sparsity for its independent view modeling.
arXiv Detail & Related papers (2025-11-27T18:58:54Z) - HD$^2$-SSC: High-Dimension High-Density Semantic Scene Completion for Autonomous Driving [52.959716866316604]
Camera-based 3D semantic scene completion (SSC) plays a crucial role in autonomous driving.<n>Existing SSC methods suffer from the inherent input-output dimension gap and annotation-reality density gap.<n>We propose a corresponding High- Dimension High-Density Semantic Scene Completion framework with expanded pixel semantics and refined voxel occupancies.
arXiv Detail & Related papers (2025-11-11T07:24:35Z) - Aerial-Ground Image Feature Matching via 3D Gaussian Splatting-based Intermediate View Rendering [7.454339483033969]
The integration of aerial and ground images has been a promising solution in 3D modeling of complex scenes.<n>The primary contribution of this study is a feature matching algorithm for aerial and ground images.
arXiv Detail & Related papers (2025-09-24T08:50:13Z) - FG$^2$: Fine-Grained Cross-View Localization by Fine-Grained Feature Matching [69.81167130510333]
We propose a novel fine-grained cross-view localization method that estimates the 3 Degrees of Freedom pose of a ground-level image in an aerial image of the surroundings.<n>The pose is estimated by aligning a point plane generated from the ground image with a point plane sampled from the aerial image.<n>Compared to the previous state-of-the-art, our method reduces the mean localization error by 28% on the VIGOR cross-area test set.
arXiv Detail & Related papers (2025-03-24T14:34:20Z) - Leveraging BEV Paradigm for Ground-to-Aerial Image Synthesis [39.43518544801439]
Ground-to-aerial image synthesis focuses on generating realistic aerial images from corresponding ground street view images.<n>We introduce SkyDiffusion, a novel cross-view generation method for synthesizing aerial images from street view images.<n>We introduce a novel dataset, Ground2Aerial-3, designed for diverse ground-to-aerial image synthesis applications.
arXiv Detail & Related papers (2024-08-03T15:43:56Z) - GEOcc: Geometrically Enhanced 3D Occupancy Network with Implicit-Explicit Depth Fusion and Contextual Self-Supervision [49.839374549646884]
This paper presents GEOcc, a Geometric-Enhanced Occupancy network tailored for vision-only surround-view perception.<n>Our approach achieves State-Of-The-Art performance on the Occ3D-nuScenes dataset with the least image resolution needed and the most weightless image backbone.
arXiv Detail & Related papers (2024-05-17T07:31:20Z) - Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery [51.73680703579997]
We present a neural radiance field method for urban-scale semantic and building-level instance segmentation from aerial images.
objects in urban aerial images exhibit substantial variations in size, including buildings, cars, and roads.
We introduce a scale-adaptive semantic label fusion strategy that enhances the segmentation of objects of varying sizes.
We then introduce a novel cross-view instance label grouping strategy to mitigate the multi-view inconsistency problem in the 2D instance labels.
arXiv Detail & Related papers (2024-03-18T14:15:39Z) - Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion [77.34078223594686]
We propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques.
Specifically, our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first, which is then transformed into a scene representation in a feed-forward manner.
Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery.
arXiv Detail & Related papers (2024-01-19T16:15:37Z) - Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images.
Our aim is to generate high-resolution images and videos from novel viewpoints.
We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.