BootsTAP: Bootstrapped Training for Tracking-Any-Point
- URL: http://arxiv.org/abs/2402.00847v2
- Date: Thu, 23 May 2024 15:00:26 GMT
- Title: BootsTAP: Bootstrapped Training for Tracking-Any-Point
- Authors: Carl Doersch, Pauline Luc, Yi Yang, Dilara Gokay, Skanda Koppula, Ankush Gupta, Joseph Heyward, Ignacio Rocco, Ross Goroshin, João Carreira, Andrew Zisserman,
- Abstract summary: Tracking-Any-Point (TAP) can be formalized as an algorithm to track any point on solid surfaces in a video.
We show how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes.
We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin.
- Score: 62.585297341343505
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To endow models with greater understanding of physics and motion, it is useful to enable them to perceive how solid surfaces move and deform in real scenes. This can be formalized as Tracking-Any-Point (TAP), which requires the algorithm to track any point on solid surfaces in a video, potentially densely in space and time. Large-scale groundtruth training data for TAP is only available in simulation, which currently has a limited variety of objects and motion. In this work, we demonstrate how large-scale, unlabeled, uncurated real-world data can improve a TAP model with minimal architectural changes, using a selfsupervised student-teacher setup. We demonstrate state-of-the-art performance on the TAP-Vid benchmark surpassing previous results by a wide margin: for example, TAP-Vid-DAVIS performance improves from 61.3% to 67.4%, and TAP-Vid-Kinetics from 57.2% to 62.5%. For visualizations, see our project webpage at https://bootstap.github.io/
Related papers
- Keypoint Aware Masked Image Modelling [0.34530027457862006]
KAMIM improves the top-1 linear probing accuracy from 16.12% to 33.97%, and finetuning accuracy from 76.78% to 77.3% when tested on the ImageNet-1K dataset with a ViT-B when trained for the same number of epochs.
We also analyze the learned representations of a ViT-B trained using KAMIM and observe that they behave similar to contrastive learning with regard to its behavior, with longer attention distances and homogenous self-attention across layers.
arXiv Detail & Related papers (2024-07-18T19:41:46Z) - An Experimental Study on Exploring Strong Lightweight Vision Transformers via Masked Image Modeling Pre-Training [51.622652121580394]
Masked image modeling (MIM) pre-training for large-scale vision transformers (ViTs) has enabled promising downstream performance on top of the learned self-supervised ViT features.
In this paper, we question if the textitextremely simple lightweight ViTs' fine-tuning performance can also benefit from this pre-training paradigm.
Our pre-training with distillation on pure lightweight ViTs with vanilla/hierarchical design ($5.7M$/$6.5M$) can achieve $79.4%$/$78.9%$ top-1 accuracy on ImageNet-1
arXiv Detail & Related papers (2024-04-18T14:14:44Z) - Effective Whole-body Pose Estimation with Two-stages Distillation [52.92064408970796]
Whole-body pose estimation localizes the human body, hand, face, and foot keypoints in an image.
We present a two-stage pose textbfDistillation for textbfWhole-body textbfPose estimators, named textbfDWPose, to improve their effectiveness and efficiency.
arXiv Detail & Related papers (2023-07-29T03:49:28Z) - HomE: Homography-Equivariant Video Representation Learning [62.89516761473129]
We propose a novel method for representation learning of multi-view videos.
Our method learns an implicit mapping between different views, culminating in a representation space that maintains the homography relationship between neighboring views.
On action classification, our method obtains 96.4% 3-fold accuracy on the UCF101 dataset, better than most state-of-the-art self-supervised learning methods.
arXiv Detail & Related papers (2023-06-02T15:37:43Z) - OVRL-V2: A simple state-of-art baseline for ImageNav and ObjectNav [62.32806118504701]
We present a single neural network architecture that achieves state-of-art results on both the ImageNav and ObjectNav tasks.
Such general-purpose methods offer advantages of simplicity in design, positive scaling with available compute, and versatile applicability to multiple tasks.
arXiv Detail & Related papers (2023-03-14T11:15:37Z) - Learning Online for Unified Segmentation and Tracking Models [30.146300294418516]
TrackMLP is a novel meta-learning method optimized to learn from only partial information.
We show that our model achieves state-of-the-art performance and tangible improvement over competing models.
arXiv Detail & Related papers (2021-11-12T23:52:59Z) - VM-MODNet: Vehicle Motion aware Moving Object Detection for Autonomous
Driving [3.6550372593827887]
Moving object Detection (MOD) is a critical task in autonomous driving.
We aim to leverage the vehicle motion information and feed it into the model to have an adaptation mechanism based on ego-motion.
The proposed model using Vehicle Motion (VMT) achieves an absolute improvement of 5.6% in mIoU over the baseline architecture.
arXiv Detail & Related papers (2021-04-22T10:46:55Z) - Self-Supervised Pretraining of 3D Features on any Point-Cloud [40.26575888582241]
We present a simple self-supervised pertaining method that can work with any 3D data without 3D registration.
We evaluate our models on 9 benchmarks for object detection, semantic segmentation, and object classification, where they achieve state-of-the-art results and can outperform supervised pretraining.
arXiv Detail & Related papers (2021-01-07T18:55:21Z) - Weakly Supervised 3D Object Detection from Lidar Point Cloud [182.67704224113862]
It is laborious to manually label point cloud data for training high-quality 3D object detectors.
This work proposes a weakly supervised approach for 3D object detection, only requiring a small set of weakly annotated scenes.
Using only 500 weakly annotated scenes and 534 precisely labeled vehicle instances, our method achieves 85-95% the performance of current top-leading, fully supervised detectors.
arXiv Detail & Related papers (2020-07-23T10:12:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.