Visual Geometry Grounded Deep Structure From Motion
- URL: http://arxiv.org/abs/2312.04563v1
- Date: Thu, 7 Dec 2023 18:59:52 GMT
- Title: Visual Geometry Grounded Deep Structure From Motion
- Authors: Jianyuan Wang, Nikita Karaev, Christian Rupprecht, David Novotny
- Abstract summary: We propose a new deep pipeline VGGSfM, where each component is fully differentiable and can be trained in an end-to-end manner.
First, we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks, which eliminates the need for chaining pairwise matches.
We attain state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism, and ETH3D.
- Score: 20.203320509695306
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Structure-from-motion (SfM) is a long-standing problem in the computer vision
community, which aims to reconstruct the camera poses and 3D structure of a
scene from a set of unconstrained 2D images. Classical frameworks solve this
problem in an incremental manner by detecting and matching keypoints,
registering images, triangulating 3D points, and conducting bundle adjustment.
Recent research efforts have predominantly revolved around harnessing the power
of deep learning techniques to enhance specific elements (e.g., keypoint
matching), but are still based on the original, non-differentiable pipeline.
Instead, we propose a new deep pipeline VGGSfM, where each component is fully
differentiable and thus can be trained in an end-to-end manner. To this end, we
introduce new mechanisms and simplifications. First, we build on recent
advances in deep 2D point tracking to extract reliable pixel-accurate tracks,
which eliminates the need for chaining pairwise matches. Furthermore, we
recover all cameras simultaneously based on the image and track features
instead of gradually registering cameras. Finally, we optimise the cameras and
triangulate 3D points via a differentiable bundle adjustment layer. We attain
state-of-the-art performance on three popular datasets, CO3D, IMC Phototourism,
and ETH3D.
Related papers
- DynOMo: Online Point Tracking by Dynamic Online Monocular Gaussian Reconstruction [65.46359561104867]
We target the challenge of online 2D and 3D point tracking from unposed monocular camera input.
We leverage 3D Gaussian splatting to reconstruct dynamic scenes in an online fashion.
We aim to inspire the community to advance online point tracking and reconstruction, expanding the applicability to diverse real-world scenarios.
arXiv Detail & Related papers (2024-09-03T17:58:03Z) - Scaling Multi-Camera 3D Object Detection through Weak-to-Strong Eliciting [32.66151412557986]
We present a weak-to-strong eliciting framework aimed at enhancing surround refinement while maintaining robust monocular perception.
Our framework employs weakly tuned experts trained on distinct subsets, and each is inherently biased toward specific camera configurations and scenarios.
For MC3D-Det joint training, the elaborate dataset merge strategy is designed to solve the problem of inconsistent camera numbers and camera parameters.
arXiv Detail & Related papers (2024-04-10T03:11:10Z) - DUSt3R: Geometric 3D Vision Made Easy [8.471330244002564]
We introduce DUSt3R, a novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections.
We show that this formulation smoothly unifies the monocular and binocular reconstruction cases.
Our formulation directly provides a 3D model of the scene as well as depth information, but interestingly, we can seamlessly recover from it, pixel matches, relative and absolute camera.
arXiv Detail & Related papers (2023-12-21T18:52:14Z) - R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras [106.52409577316389]
R3D3 is a multi-camera system for dense 3D reconstruction and ego-motion estimation.
Our approach exploits spatial-temporal information from multiple cameras, and monocular depth refinement.
We show that this design enables a dense, consistent 3D reconstruction of challenging, dynamic outdoor environments.
arXiv Detail & Related papers (2023-08-28T17:13:49Z) - FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models [67.96827539201071]
We propose a novel test-time optimization approach for 3D scene reconstruction.
Our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets.
arXiv Detail & Related papers (2023-08-10T17:55:02Z) - Learning Multi-View Aggregation In the Wild for Large-Scale 3D Semantic
Segmentation [3.5939555573102853]
Recent works on 3D semantic segmentation propose to exploit the synergy between images and point clouds by processing each modality with a dedicated network.
We propose an end-to-end trainable multi-view aggregation model leveraging the viewing conditions of 3D points to merge features from images taken at arbitrary positions.
Our method can combine standard 2D and 3D networks and outperforms both 3D models operating on colorized point clouds and hybrid 2D/3D networks.
arXiv Detail & Related papers (2022-04-15T17:10:48Z) - Unsupervised Learning of Visual 3D Keypoints for Control [104.92063943162896]
Learning sensorimotor control policies from high-dimensional images crucially relies on the quality of the underlying visual representations.
We propose a framework to learn such a 3D geometric structure directly from images in an end-to-end unsupervised manner.
These discovered 3D keypoints tend to meaningfully capture robot joints as well as object movements in a consistent manner across both time and 3D space.
arXiv Detail & Related papers (2021-06-14T17:59:59Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.