Track Everything Everywhere Fast and Robustly
- URL: http://arxiv.org/abs/2403.17931v1
- Date: Tue, 26 Mar 2024 17:58:22 GMT
- Title: Track Everything Everywhere Fast and Robustly
- Authors: Yunzhou Song, Jiahui Lei, Ziyun Wang, Lingjie Liu, Kostas Daniilidis,
- Abstract summary: We propose a novel test-time optimization approach for efficiently tracking any pixel in a video.
We introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid.
Our experiments demonstrate a substantial improvement in training speed (more than textbf10 times faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.
- Score: 46.362962852140015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than \textbf{10 times} faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.
Related papers
- Universal Online Temporal Calibration for Optimization-based Visual-Inertial Navigation Systems [13.416013522770905]
We propose a universal online temporal calibration strategy for optimization-based visual-inertial navigation systems.
We use the time offset td as a state parameter in the optimization residual model to align the IMU state to the corresponding image timestamp.
Our approach provides more accurate time offset estimation and faster convergence, particularly in the presence of noisy sensor data.
arXiv Detail & Related papers (2025-01-03T12:41:25Z) - Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - HGSLoc: 3DGS-based Heuristic Camera Pose Refinement [13.393035855468428]
Visual localization refers to the process of determining camera poses and orientation within a known scene representation.
In this paper, we propose HGSLoc, which integrates 3D reconstruction with a refinement strategy to achieve higher pose estimation accuracy.
Our method demonstrates a faster rendering speed and higher localization accuracy compared to NeRF-based neural rendering approaches.
arXiv Detail & Related papers (2024-09-17T06:48:48Z) - D-NPC: Dynamic Neural Point Clouds for Non-Rigid View Synthesis from Monocular Video [53.83936023443193]
This paper contributes to the field by introducing a new synthesis method for dynamic novel view from monocular video, such as smartphone captures.
Our approach represents the as a $textitdynamic neural point cloud$, an implicit time-conditioned point cloud that encodes local geometry and appearance in separate hash-encoded neural feature grids.
arXiv Detail & Related papers (2024-06-14T14:35:44Z) - OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control [66.03885917320189]
OrientDream is a camera orientation conditioned framework for efficient and multi-view consistent 3D generation from textual prompts.
Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module.
Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods.
arXiv Detail & Related papers (2024-06-14T13:16:18Z) - PNeRFLoc: Visual Localization with Point-based Neural Radiance Fields [54.8553158441296]
We propose a novel visual localization framework, ie, PNeRFLoc, based on a unified point-based representation.
On the one hand, PNeRFLoc supports the initial pose estimation by matching 2D and 3D feature points.
On the other hand, it also enables pose refinement with novel view synthesis using rendering-based optimization.
arXiv Detail & Related papers (2023-12-17T08:30:00Z) - Break a Lag: Triple Exponential Moving Average for Enhanced Optimization [2.0199251985015434]
We introduce Fast Adaptive Moment Estimation (FAME), a novel optimization technique that leverages the power of Triple Exponential Moving Average.
FAME enhances responsiveness to data dynamics, mitigates trend identification lag, and optimize learning efficiency.
Our comprehensive evaluation encompasses different computer vision tasks including image classification, object detection, and semantic segmentation, integrating FAME into 30 distinct architectures.
arXiv Detail & Related papers (2023-06-02T10:29:33Z) - Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal
Attention, and Optimal Transport [18.717832661972896]
New approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics.
Method exactly preserves the manifold structure but does not require commonly used projection or retraction.
Its generalization to adaptive learning rates is also demonstrated.
arXiv Detail & Related papers (2022-05-27T18:01:45Z) - DiffSkill: Skill Abstraction from Differentiable Physics for Deformable
Object Manipulations with Tools [96.38972082580294]
DiffSkill is a novel framework that uses a differentiable physics simulator for skill abstraction to solve deformable object manipulation tasks.
In particular, we first obtain short-horizon skills using individual tools from a gradient-based simulator.
We then learn a neural skill abstractor from the demonstration trajectories which takes RGBD images as input.
arXiv Detail & Related papers (2022-03-31T17:59:38Z) - Efficient Global Optimization of Non-differentiable, Symmetric
Objectives for Multi Camera Placement [0.0]
We propose a novel iterative method for optimally placing and orienting multiple cameras in a 3D scene.
Sample applications include improving the accuracy of 3D reconstruction, maximizing the covered area for surveillance, or improving the coverage in multi-viewpoint pedestrian tracking.
arXiv Detail & Related papers (2021-03-20T17:01:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.