Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning
- URL: http://arxiv.org/abs/2602.20157v1
- Date: Mon, 23 Feb 2026 18:59:30 GMT
- Title: Flow3r: Factored Flow Prediction for Scalable Visual Geometry Learning
- Authors: Zhongxiao Cong, Qitao Zhao, Minsik Jeon, Shubham Tulsiani,
- Abstract summary: Flow3r is a framework that augments visual geometry learning with dense 2D correspondences (flow') as supervision.<n>Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other.
- Score: 28.722572714606112
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Current feed-forward 3D/4D reconstruction systems rely on dense geometry and pose supervision -- expensive to obtain at scale and particularly scarce for dynamic real-world scenes. We present Flow3r, a framework that augments visual geometry learning with dense 2D correspondences (`flow') as supervision, enabling scalable training from unlabeled monocular videos. Our key insight is that the flow prediction module should be factored: predicting flow between two images using geometry latents from one and pose latents from the other. This factorization directly guides the learning of both scene geometry and camera motion, and naturally extends to dynamic scenes. In controlled experiments, we show that factored flow prediction outperforms alternative designs and that performance scales consistently with unlabeled data. Integrating factored flow into existing visual geometry architectures and training with ${\sim}800$K unlabeled videos, Flow3r achieves state-of-the-art results across eight benchmarks spanning static and dynamic scenes, with its largest gains on in-the-wild dynamic videos where labeled data is most scarce.
Related papers
- Flow4R: Unifying 4D Reconstruction and Tracking with Scene Flow [61.297800738187355]
Flow4R predicts a minimal per-pixel property set-3D point position, scene flow, pose weight, and confidence-from two-view inputs using a Vision Transformer.<n> trained jointly on static and dynamic datasets, Flow4R achieves state-of-the-art performance on 4D reconstruction and tracking tasks.
arXiv Detail & Related papers (2026-02-15T06:58:08Z) - Scalable Adaptation of 3D Geometric Foundation Models via Weak Supervision from Internet Video [76.32954467706581]
We propose SAGE, a framework for Scalable Adaptation of GEometric foundation models from raw video streams.<n>We use a hierarchical mining pipeline to transform videos into training trajectories and hybrid supervision.<n>Experiments show that SAGE significantly enhances zero-shot generalization, reducing Chamfer Distance by 20-42% on unseen benchmarks.
arXiv Detail & Related papers (2026-02-08T09:53:21Z) - Flow-Anything: Learning Real-World Optical Flow Estimation from Large-Scale Single-view Images [23.731451842621933]
We develop a large-scale data generation framework designed to learn optical flow estimation from any single-view images in the real world.<n>For the first time, we demonstrate the benefits of generating optical flow training data from large-scale real-world images.<n>Our models serve as a foundation model and enhance the performance of various downstream video tasks.
arXiv Detail & Related papers (2025-06-09T13:23:44Z) - VoxelSplat: Dynamic Gaussian Splatting as an Effective Loss for Occupancy and Flow Prediction [46.31516096522758]
Recent advancements in camera-based occupancy prediction have focused on the simultaneous prediction of 3D semantics and scene flow.<n>We propose a novel regularization framework called VoxelSplat to address these challenges and their underlying causes.<n>Our framework uses the predicted scene flow to model the motion of Gaussians, and is thus able to learn the scene flow of moving objects in a self-supervised manner.
arXiv Detail & Related papers (2025-06-05T20:19:35Z) - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion [118.74385965694694]
We present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes.<n>By simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes.<n>We show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics.
arXiv Detail & Related papers (2024-10-04T18:00:07Z) - Semantic Flow: Learning Semantic Field of Dynamic Scenes from Monocular Videos [23.275595857385884]
We pioneer Semantic Flow, a neural semantic representation of dynamic scenes from monocular videos.
We first learn a flow network to predict flows in the dynamic scene, and propose a flow feature aggregation module to extract flow features from video frames.
Then, we propose a flow attention module to extract motion information from flow features, which is followed by a semantic network to output semantic logits of flows.
arXiv Detail & Related papers (2024-04-08T03:06:19Z) - AutoDecoding Latent 3D Diffusion Models [95.7279510847827]
We present a novel approach to the generation of static and articulated 3D assets that has a 3D autodecoder at its core.
The 3D autodecoder framework embeds properties learned from the target dataset in the latent space.
We then identify the appropriate intermediate volumetric latent space, and introduce robust normalization and de-normalization operations.
arXiv Detail & Related papers (2023-07-07T17:59:14Z) - Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection
Consistency [114.02182755620784]
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision.
Our framework is shown to outperform the state-of-the-art depth and motion estimation methods.
arXiv Detail & Related papers (2021-02-04T14:26:42Z) - Learning to Segment Rigid Motions from Two Frames [72.14906744113125]
We propose a modular network, motivated by a geometric analysis of what independent object motions can be recovered from an egomotion field.
It takes two consecutive frames as input and predicts segmentation masks for the background and multiple rigidly moving objects, which are then parameterized by 3D rigid transformations.
Our method achieves state-of-the-art performance for rigid motion segmentation on KITTI and Sintel.
arXiv Detail & Related papers (2021-01-11T04:20:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.