AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos
- URL: http://arxiv.org/abs/2503.23282v1
- Date: Sun, 30 Mar 2025 02:22:11 GMT
- Title: AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos
- Authors: Felix Wimbauer, Weirong Chen, Dominik Muhle, Christian Rupprecht, Daniel Cremers,
- Abstract summary: We propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence.<n>We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively.<n>By combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.
- Score: 52.726585508669686
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating camera motion and intrinsics from casual videos is a core challenge in computer vision. Traditional bundle-adjustment based methods, such as SfM and SLAM, struggle to perform reliably on arbitrary data. Although specialized SfM approaches have been developed for handling dynamic scenes, they either require intrinsics or computationally expensive test-time optimization and often fall short in performance. Recently, methods like Dust3r have reformulated the SfM problem in a more data-driven way. While such techniques show promising results, they are still 1) not robust towards dynamic objects and 2) require labeled data for supervised training. As an alternative, we propose AnyCam, a fast transformer model that directly estimates camera poses and intrinsics from a dynamic video sequence in feed-forward fashion. Our intuition is that such a network can learn strong priors over realistic camera poses. To scale up our training, we rely on an uncertainty-based loss formulation and pre-trained depth and flow networks instead of motion or trajectory supervision. This allows us to use diverse, unlabelled video datasets obtained mostly from YouTube. Additionally, we ensure that the predicted trajectory does not accumulate drift over time through a lightweight trajectory refinement step. We test AnyCam on established datasets, where it delivers accurate camera poses and intrinsics both qualitatively and quantitatively. Furthermore, even with trajectory refinement, AnyCam is significantly faster than existing works for SfM in dynamic settings. Finally, by combining camera information, uncertainty, and depth, our model can produce high-quality 4D pointclouds.
Related papers
- Joint Optimization of Neural Radiance Fields and Continuous Camera Motion from a Monocular Video [22.760823792026056]
We propose a novel method that eliminates prior dependencies by modeling continuous camera motions as time-dependent angular velocity and velocity.
Our approach achieves superior camera pose and depth estimation and comparable novel-view synthesis performance compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-04-28T14:22:04Z) - Towards Understanding Camera Motions in Any Video [80.223048294482]
We introduce CameraBench, a large-scale dataset and benchmark designed to assess and improve camera motion understanding.
CameraBench consists of 3,000 diverse internet videos annotated by experts through a rigorous quality control process.
One of our contributions is a taxonomy of camera motion primitives, designed in collaboration with cinematographers.
arXiv Detail & Related papers (2025-04-21T18:34:57Z) - ReCamMaster: Camera-Controlled Generative Rendering from A Single Video [72.42376733537925]
ReCamMaster is a camera-controlled generative video re-rendering framework.<n>It reproduces the dynamic scene of an input video at novel camera trajectories.<n>Our method also finds promising applications in video stabilization, super-resolution, and outpainting.
arXiv Detail & Related papers (2025-03-14T17:59:31Z) - Learning Camera Movement Control from Real-World Drone Videos [25.10006841389459]
Existing AI videography methods struggle with limited appearance diversity in simulation training.<n>We propose a scalable method that involves collecting real-world training data.<n>We show that our system effectively learns to perform challenging camera movements.
arXiv Detail & Related papers (2024-12-12T18:59:54Z) - MegaSaM: Accurate, Fast, and Robust Structure and Motion from Casual Dynamic Videos [104.1338295060383]
We present a system that allows for accurate, fast, and robust estimation of camera parameters and depth maps from casual monocular videos of dynamic scenes.<n>Our system is significantly more accurate and robust at camera pose and depth estimation when compared with prior and concurrent work.
arXiv Detail & Related papers (2024-12-05T18:59:42Z) - AC3D: Analyzing and Improving 3D Camera Control in Video Diffusion Transformers [66.29824750770389]
We analyze camera motion from a first principles perspective, uncovering insights that enable precise 3D camera manipulation.<n>We compound these findings to design the Advanced 3D Camera Control (AC3D) architecture.
arXiv Detail & Related papers (2024-11-27T18:49:13Z) - RoMo: Robust Motion Segmentation Improves Structure from Motion [46.77236343300953]
We propose a novel approach to video-based motion segmentation to identify the components of a scene that are moving w.r.t. a fixed world frame.<n>Our simple but effective iterative method, RoMo, combines optical flow and epipolar cues with a pre-trained video segmentation model.<n>More importantly, the combination of an off-the-shelf SfM pipeline with our segmentation masks establishes a new state-of-the-art on camera calibration for scenes with dynamic content, outperforming existing methods by a substantial margin.
arXiv Detail & Related papers (2024-11-27T01:09:56Z) - FlowCam: Training Generalizable 3D Radiance Fields without Camera Poses
via Pixel-Aligned Scene Flow [26.528667940013598]
Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning.
Key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion.
We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass.
arXiv Detail & Related papers (2023-05-31T20:58:46Z) - RM-Depth: Unsupervised Learning of Recurrent Monocular Depth in Dynamic
Scenes [7.81768535871051]
Unsupervised learning framework is proposed to jointly predict monocular depth and complete 3D motion.
Recurrent modulation units are used to adaptively and iteratively fuse encoder and decoder features.
A warping-based network is used to estimate a motion field of moving objects without using semantic priors.
arXiv Detail & Related papers (2023-03-08T09:11:50Z) - ParticleSfM: Exploiting Dense Point Trajectories for Localizing Moving
Cameras in the Wild [57.37891682117178]
We present a robust dense indirect structure-from-motion method for videos that is based on dense correspondence from pairwise optical flow.
A novel neural network architecture is proposed for processing irregular point trajectory data.
Experiments on MPI Sintel dataset show that our system produces significantly more accurate camera trajectories.
arXiv Detail & Related papers (2022-07-19T09:19:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.