Track Everything Everywhere Fast and Robustly
- URL: http://arxiv.org/abs/2403.17931v1
- Date: Tue, 26 Mar 2024 17:58:22 GMT
- Title: Track Everything Everywhere Fast and Robustly
- Authors: Yunzhou Song, Jiahui Lei, Ziyun Wang, Lingjie Liu, Kostas Daniilidis,
- Abstract summary: We propose a novel test-time optimization approach for efficiently tracking any pixel in a video.
We introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid.
Our experiments demonstrate a substantial improvement in training speed (more than textbf10 times faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.
- Score: 46.362962852140015
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose a novel test-time optimization approach for efficiently and robustly tracking any pixel at any time in a video. The latest state-of-the-art optimization-based tracking technique, OmniMotion, requires a prohibitively long optimization time, rendering it impractical for downstream applications. OmniMotion is sensitive to the choice of random seeds, leading to unstable convergence. To improve efficiency and robustness, we introduce a novel invertible deformation network, CaDeX++, which factorizes the function representation into a local spatial-temporal feature grid and enhances the expressivity of the coupling blocks with non-linear functions. While CaDeX++ incorporates a stronger geometric bias within its architectural design, it also takes advantage of the inductive bias provided by the vision foundation models. Our system utilizes monocular depth estimation to represent scene geometry and enhances the objective by incorporating DINOv2 long-term semantics to regulate the optimization process. Our experiments demonstrate a substantial improvement in training speed (more than \textbf{10 times} faster), robustness, and accuracy in tracking over the SoTA optimization-based method OmniMotion.
Related papers
- Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge.
Existing methods struggle to balance high model performance with low resource consumption.
We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z) - HGSLoc: 3DGS-based Heuristic Camera Pose Refinement [13.393035855468428]
Visual localization refers to the process of determining camera poses and orientation within a known scene representation.
In this paper, we propose HGSLoc, which integrates 3D reconstruction with a refinement strategy to achieve higher pose estimation accuracy.
Our method demonstrates a faster rendering speed and higher localization accuracy compared to NeRF-based neural rendering approaches.
arXiv Detail & Related papers (2024-09-17T06:48:48Z) - D-NPC: Dynamic Neural Point Clouds for Non-Rigid View Synthesis from Monocular Video [53.83936023443193]
This paper contributes to the field by introducing a new synthesis method for dynamic novel view from monocular video, such as smartphone captures.
Our approach represents the as a $textitdynamic neural point cloud$, an implicit time-conditioned point cloud that encodes local geometry and appearance in separate hash-encoded neural feature grids.
arXiv Detail & Related papers (2024-06-14T14:35:44Z) - OrientDream: Streamlining Text-to-3D Generation with Explicit Orientation Control [66.03885917320189]
OrientDream is a camera orientation conditioned framework for efficient and multi-view consistent 3D generation from textual prompts.
Our strategy emphasizes the implementation of an explicit camera orientation conditioned feature in the pre-training of a 2D text-to-image diffusion module.
Our experiments reveal that our method not only produces high-quality NeRF models with consistent multi-view properties but also achieves an optimization speed significantly greater than existing methods.
arXiv Detail & Related papers (2024-06-14T13:16:18Z) - VoxNeRF: Bridging Voxel Representation and Neural Radiance Fields for
Enhanced Indoor View Synthesis [51.49008959209671]
We introduce VoxNeRF, a novel approach that leverages volumetric representations to enhance the quality and efficiency of indoor view synthesis.
We employ multi-resolution hash grids to adaptively capture spatial features, effectively managing occlusions and the intricate geometry of indoor scenes.
We validate our approach against three public indoor datasets and demonstrate that VoxNeRF outperforms state-of-the-art methods.
arXiv Detail & Related papers (2023-11-09T11:32:49Z) - Transformer-Based Learned Optimization [37.84626515073609]
We propose a new approach to learned optimization where we represent the computation's update step using a neural network.
Our innovation is a new neural network architecture inspired by the classic BFGS algorithm.
We demonstrate the advantages of our approach on a benchmark composed of objective functions traditionally used for the evaluation of optimization algorithms.
arXiv Detail & Related papers (2022-12-02T09:47:08Z) - Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal
Attention, and Optimal Transport [18.717832661972896]
New approach is proposed based on, for the first time, an interplay between thoughtfully designed continuous and discrete dynamics.
Method exactly preserves the manifold structure but does not require commonly used projection or retraction.
Its generalization to adaptive learning rates is also demonstrated.
arXiv Detail & Related papers (2022-05-27T18:01:45Z) - DiffSkill: Skill Abstraction from Differentiable Physics for Deformable
Object Manipulations with Tools [96.38972082580294]
DiffSkill is a novel framework that uses a differentiable physics simulator for skill abstraction to solve deformable object manipulation tasks.
In particular, we first obtain short-horizon skills using individual tools from a gradient-based simulator.
We then learn a neural skill abstractor from the demonstration trajectories which takes RGBD images as input.
arXiv Detail & Related papers (2022-03-31T17:59:38Z) - Efficient Global Optimization of Non-differentiable, Symmetric
Objectives for Multi Camera Placement [0.0]
We propose a novel iterative method for optimally placing and orienting multiple cameras in a 3D scene.
Sample applications include improving the accuracy of 3D reconstruction, maximizing the covered area for surveillance, or improving the coverage in multi-viewpoint pedestrian tracking.
arXiv Detail & Related papers (2021-03-20T17:01:15Z) - Enhanced data efficiency using deep neural networks and Gaussian
processes for aerodynamic design optimization [0.0]
Adjoint-based optimization methods are attractive for aerodynamic shape design.
They can become prohibitively expensive when multiple optimization problems are being solved.
We propose a machine learning enabled, surrogate-based framework that replaces the expensive adjoint solver.
arXiv Detail & Related papers (2020-08-15T15:09:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.