Can Visual Foundation Models Achieve Long-term Point Tracking?
- URL: http://arxiv.org/abs/2408.13575v1
- Date: Sat, 24 Aug 2024 12:58:08 GMT
- Title: Can Visual Foundation Models Achieve Long-term Point Tracking?
- Authors: Görkay Aydemir, Weidi Xie, Fatma Güney,
- Abstract summary: We evaluate the geometric awareness of visual foundation models in the context of point tracking.
Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings.
- Score: 37.95592121632532
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large-scale vision foundation models have demonstrated remarkable success across various tasks, underscoring their robust generalization capabilities. While their proficiency in two-view correspondence has been explored, their effectiveness in long-term correspondence within complex environments remains unexplored. To address this, we evaluate the geometric awareness of visual foundation models in the context of point tracking: (i) in zero-shot settings, without any training; (ii) by probing with low-capacity layers; (iii) by fine-tuning with Low Rank Adaptation (LoRA). Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings. Furthermore, DINOv2 achieves performance comparable to supervised models in adaptation settings, demonstrating its potential as a strong initialization for correspondence learning.
Related papers
- Unsupervised Non-Rigid Point Cloud Matching through Large Vision Models [1.3030624795284795]
We propose a learning-based framework for non-rigid point cloud matching.
Key insight is to incorporate semantic features derived from large vision models (LVMs)
Our framework effectively leverages the structural information contained in the semantic features to address ambiguities arise from self-similarities among local geometries.
arXiv Detail & Related papers (2024-08-16T07:02:19Z) - 4D Contrastive Superflows are Dense 3D Representation Learners [62.433137130087445]
We introduce SuperFlow, a novel framework designed to harness consecutive LiDAR-camera pairs for establishing pretraining objectives.
To further boost learning efficiency, we incorporate a plug-and-play view consistency module that enhances alignment of the knowledge distilled from camera views.
arXiv Detail & Related papers (2024-07-08T17:59:54Z) - EffoVPR: Effective Foundation Model Utilization for Visual Place Recognition [6.996304653818122]
We propose a simple yet powerful approach to better exploit the potential of a foundation model for Visual Place Recognition.
We first demonstrate that features extracted from self-attention layers can serve as a powerful re-ranker for VPR.
We then demonstrate that a single-stage method leveraging internal ViT layers for pooling can generate global features that achieve state-of-the-art results.
arXiv Detail & Related papers (2024-05-28T11:24:41Z) - Vision Superalignment: Weak-to-Strong Generalization for Vision
Foundation Models [55.919653720979824]
This paper focuses on the concept of weak-to-strong generalization, which involves using a weaker model to supervise a stronger one.
We introduce a novel and adaptively adjustable loss function for weak-to-strong supervision.
Our approach not only exceeds the performance benchmarks set by strong-to-strong generalization but also surpasses the outcomes of fine-tuning strong models with whole datasets.
arXiv Detail & Related papers (2024-02-06T06:30:34Z) - Supervised Fine-tuning in turn Improves Visual Foundation Models [74.1760864718129]
Two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models.
Vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks.
arXiv Detail & Related papers (2024-01-18T18:58:54Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps [39.00415825387414]
We propose a new approach for semantic correspondence estimation that supplements discriminative features with 3D understanding via a weak geometric spherical prior.
Compared to more involved 3D pipelines, our model only requires weak viewpoint information, and the simplicity of our spherical representation enables us to inject informative geometric priors into the model during training.
We present results on the challenging SPair-71k dataset, where our approach demonstrates is capable of distinguishing between symmetric views and repeated parts across many object categories.
arXiv Detail & Related papers (2023-12-20T17:35:24Z) - Higher-Order Implicit Fairing Networks for 3D Human Pose Estimation [1.1501261942096426]
We introduce a higher-order graph convolutional framework with initial residual connections for 2D-to-3D pose estimation.
Our model is able to capture the long-range dependencies between body joints.
Experiments and ablations studies conducted on two standard benchmarks demonstrate the effectiveness of our model.
arXiv Detail & Related papers (2021-11-01T13:48:55Z) - One-Shot Object Detection without Fine-Tuning [62.39210447209698]
We introduce a two-stage model consisting of a first stage Matching-FCOS network and a second stage Structure-Aware Relation Module.
We also propose novel training strategies that effectively improve detection performance.
Our method exceeds the state-of-the-art one-shot performance consistently on multiple datasets.
arXiv Detail & Related papers (2020-05-08T01:59:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.