Learning Tracking Representations from Single Point Annotations
- URL: http://arxiv.org/abs/2404.09504v1
- Date: Mon, 15 Apr 2024 06:50:58 GMT
- Title: Learning Tracking Representations from Single Point Annotations
- Authors: Qiangqiang Wu, Antoni B. Chan,
- Abstract summary: We propose to learn tracking representations from single point annotations in a weakly supervised manner.
Specifically, we propose a soft contrastive learning framework that incorporates target objectness prior to end-to-end contrastive learning.
- Score: 49.47550029470299
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing deep trackers are typically trained with largescale video frames with annotated bounding boxes. However, these bounding boxes are expensive and time-consuming to annotate, in particular for large scale datasets. In this paper, we propose to learn tracking representations from single point annotations (i.e., 4.5x faster to annotate than the traditional bounding box) in a weakly supervised manner. Specifically, we propose a soft contrastive learning (SoCL) framework that incorporates target objectness prior into end-to-end contrastive learning. Our SoCL consists of adaptive positive and negative sample generation, which is memory-efficient and effective for learning tracking representations. We apply the learned representation of SoCL to visual tracking and show that our method can 1) achieve better performance than the fully supervised baseline trained with box annotations under the same annotation time cost; 2) achieve comparable performance of the fully supervised baseline by using the same number of training frames and meanwhile reducing annotation time cost by 78% and total fees by 85%; 3) be robust to annotation noise.
Related papers
- One-bit Supervision for Image Classification: Problem, Solution, and
Beyond [114.95815360508395]
This paper presents one-bit supervision, a novel setting of learning with fewer labels, for image classification.
We propose a multi-stage training paradigm and incorporate negative label suppression into an off-the-shelf semi-supervised learning algorithm.
In multiple benchmarks, the learning efficiency of the proposed approach surpasses that using full-bit, semi-supervised supervision.
arXiv Detail & Related papers (2023-11-26T07:39:00Z) - FOCAL: A Cost-Aware Video Dataset for Active Learning [13.886774655927875]
annotation-cost refers to the time it takes an annotator to label and quality-assure a given video sequence.
We introduce a set of conformal active learning algorithms that take advantage of the sequential structure of video data.
We show that the best conformal active learning method is cheaper than the best traditional active learning method by 113 hours.
arXiv Detail & Related papers (2023-11-17T15:46:09Z) - Active Self-Training for Weakly Supervised 3D Scene Semantic
Segmentation [17.27850877649498]
We introduce a method for weakly supervised segmentation of 3D scenes that combines self-training and active learning.
We demonstrate that our approach leads to an effective method that provides improvements in scene segmentation over previous works and baselines.
arXiv Detail & Related papers (2022-09-15T06:00:25Z) - Video Annotation for Visual Tracking via Selection and Refinement [74.08109740917122]
We present a new framework to facilitate bounding box annotations for video sequences.
A temporal assessment network is proposed which is able to capture the temporal coherence of target locations.
A visual-geometry refinement network is also designed to further enhance the selected tracking results.
arXiv Detail & Related papers (2021-08-09T05:56:47Z) - Learning to Track Instances without Video Annotations [85.9865889886669]
We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
arXiv Detail & Related papers (2021-04-01T06:47:41Z) - Improving Semantic Segmentation via Self-Training [75.07114899941095]
We show that we can obtain state-of-the-art results using a semi-supervised approach, specifically a self-training paradigm.
We first train a teacher model on labeled data, and then generate pseudo labels on a large set of unlabeled data.
Our robust training framework can digest human-annotated and pseudo labels jointly and achieve top performances on Cityscapes, CamVid and KITTI datasets.
arXiv Detail & Related papers (2020-04-30T17:09:17Z) - Towards Using Count-level Weak Supervision for Crowd Counting [55.58468947486247]
This paper studies the problem of weakly-supervised crowd counting which learns a model from only a small amount of location-level annotations (fully-supervised) but a large amount of count-level annotations (weakly-supervised)
We devise a simple-yet-effective training strategy, namely Multiple Auxiliary Tasks Training (MATT), to construct regularizes for restricting the freedom of the generated density maps.
arXiv Detail & Related papers (2020-02-29T02:58:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.