Related papers: Unleashing the Power of Data Synthesis in Visual Localization

Unleashing the Power of Data Synthesis in Visual Localization

URL: http://arxiv.org/abs/2412.00138v1
Date: Thu, 28 Nov 2024 16:58:10 GMT
Title: Unleashing the Power of Data Synthesis in Visual Localization
Authors: Sihang Li, Siqi Tan, Bowen Chang, Jing Zhang, Chen Feng, Yiming Li,
Abstract summary: Methods that regress camera poses from query images have gained attention for fast inference.<n>We aim to unleash the power of data synthesis to promote the generalizability of pose regression.<n>We build a two-branch joint training pipeline, with an adversarial discriminator to bridge the syn-to-real gap.
Score: 17.159091187694884
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Visual localization, which estimates a camera's pose within a known scene, is a long-standing challenge in vision and robotics. Recent end-to-end methods that directly regress camera poses from query images have gained attention for fast inference. However, existing methods often struggle to generalize to unseen views. In this work, we aim to unleash the power of data synthesis to promote the generalizability of pose regression. Specifically, we lift real 2D images into 3D Gaussian Splats with varying appearance and deblurring abilities, which are then used as a data engine to synthesize more posed images. To fully leverage the synthetic data, we build a two-branch joint training pipeline, with an adversarial discriminator to bridge the syn-to-real gap. Experiments on established benchmarks show that our method outperforms state-of-the-art end-to-end approaches, reducing translation and rotation errors by 50% and 21.6% on indoor datasets, and 35.56% and 38.7% on outdoor datasets. We also validate the effectiveness of our method in dynamic driving scenarios under varying weather conditions. Notably, as data synthesis scales up, our method exhibits a growing ability to interpolate and extrapolate training data for localizing unseen views. Project Page: https://ai4ce.github.io/RAP/

Related papers

AerialMegaDepth: Learning Aerial-Ground Reconstruction and View Synthesis [57.249817395828174]
We propose a scalable framework combining pseudo-synthetic renderings from 3D city-wide meshes with real, ground-level crowd-sourced images. The pseudo-synthetic data simulates a wide range of aerial viewpoints, while the real, crowd-sourced images help improve visual fidelity for ground-level images. Using this hybrid dataset, we fine-tune several state-of-the-art algorithms and achieve significant improvements on real-world, zero-shot aerial-ground tasks.
arXiv Detail & Related papers (2025-04-17T17:57:05Z)
Drive-1-to-3: Enriching Diffusion Priors for Novel View Synthesis of Real Vehicles [81.29018359825872]
This paper consolidates a set of good practices to finetune large pretrained models for a real-world task. Specifically, we develop several strategies to account for discrepancies between the synthetic data and real driving data. Our insights lead to effective finetuning that results in a $68.8%$ reduction in FID for novel view synthesis over prior arts.
arXiv Detail & Related papers (2024-12-19T03:39:13Z)
GPS-Gaussian+: Generalizable Pixel-wise 3D Gaussian Splatting for Real-Time Human-Scene Rendering from Sparse Views [67.34073368933814]
We propose a generalizable Gaussian Splatting approach for high-resolution image rendering under a sparse-view camera setting. We train our Gaussian parameter regression module on human-only data or human-scene data, jointly with a depth estimation module to lift 2D parameter maps to 3D space. Experiments on several datasets demonstrate that our method outperforms state-of-the-art methods while achieving an exceeding rendering speed.
arXiv Detail & Related papers (2024-11-18T08:18:44Z)
No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images [100.80376573969045]
NoPoSplat is a feed-forward model capable of reconstructing 3D scenes parameterized by 3D Gaussians from multi-view images. Our model achieves real-time 3D Gaussian reconstruction during inference. This work makes significant advances in pose-free generalizable 3D reconstruction and demonstrates its applicability to real-world scenarios.
arXiv Detail & Related papers (2024-10-31T17:58:22Z)
FaVoR: Features via Voxel Rendering for Camera Relocalization [23.7893950095252]
Camera relocalization methods range from dense image alignment to direct camera pose regression from a query image.<n>We propose a novel approach that leverages a globally sparse yet locally dense 3D representation of 2D features.<n>By tracking and triangulating landmarks over a sequence of frames, we construct a sparse voxel map optimized to render image patch descriptors observed during tracking.
arXiv Detail & Related papers (2024-09-11T18:58:16Z)
Synfeal: A Data-Driven Simulator for End-to-End Camera Localization [0.9749560288448114]
We propose a framework that synthesizes large localization datasets based on realistic 3D reconstructions of the real world. Our framework, Synfeal, is an open-source, data-driven simulator that synthesizes RGB images by moving a virtual camera through a realistic 3D textured mesh. The results validate that the training of camera localization algorithms on datasets generated by Synfeal leads to better results when compared to datasets generated by state-of-the-art methods.
arXiv Detail & Related papers (2023-05-29T17:29:02Z)
Real-Time Radiance Fields for Single-Image Portrait View Synthesis [85.32826349697972]
We present a one-shot method to infer and render a 3D representation from a single unposed image in real-time. Given a single RGB input, our image encoder directly predicts a canonical triplane representation of a neural radiance field for 3D-aware novel view synthesis via volume rendering. Our method is fast (24 fps) on consumer hardware, and produces higher quality results than strong GAN-inversion baselines that require test-time optimization.
arXiv Detail & Related papers (2023-05-03T17:56:01Z)
Benchmarking the Robustness of LiDAR-Camera Fusion for 3D Object Detection [58.81316192862618]
Two critical sensors for 3D perception in autonomous driving are the camera and the LiDAR. fusing these two modalities can significantly boost the performance of 3D perception models. We benchmark the state-of-the-art fusion methods for the first time.
arXiv Detail & Related papers (2022-05-30T09:35:37Z)
Simple and Effective Synthesis of Indoor 3D Scenes [78.95697556834536]
We study the problem of immersive 3D indoor scenes from one or more images. Our aim is to generate high-resolution images and videos from novel viewpoints. We propose an image-to-image GAN that maps directly from reprojections of incomplete point clouds to full high-resolution RGB-D images.
arXiv Detail & Related papers (2022-04-06T17:54:46Z)
DFNet: Enhance Absolute Pose Regression with Direct Feature Matching [16.96571417692014]
We introduce a camera relocalization pipeline that combines absolute pose regression (APR) and direct feature matching. We show that our method achieves a state-of-the-art accuracy by outperforming existing single-image APR methods by as much as 56%, comparable to 3D structure-based methods.
arXiv Detail & Related papers (2022-04-01T16:39:16Z)
CrossLoc: Scalable Aerial Localization Assisted by Multimodal Synthetic Data [2.554905387213586]
We present a visual localization system that learns to estimate camera poses in the real world with the help of synthetic data. To mitigate the data scarcity issue, we introduce TOPO-DataGen, a versatile synthetic data generation tool. We also introduce CrossLoc, a cross-modal visual representation learning approach to pose estimation.
arXiv Detail & Related papers (2021-12-16T18:05:48Z)
LENS: Localization enhanced by NeRF synthesis [3.4386226615580107]
We demonstrate improvement of camera pose regression thanks to an additional synthetic dataset rendered by the NeRF class of algorithm. We further improved localization accuracy of pose regressors using synthesized realistic and geometry consistent images as data augmentation during training.
arXiv Detail & Related papers (2021-10-13T08:15:08Z)
Unsupervised Feature Learning for Event Data: Direct vs Inverse Problem Formulation [53.850686395708905]
Event-based cameras record an asynchronous stream of per-pixel brightness changes. In this paper, we focus on single-layer architectures for representation learning from event data. We show improvements of up to 9 % in the recognition accuracy compared to the state-of-the-art methods.
arXiv Detail & Related papers (2020-09-23T10:40:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.