Learning to Track Instances without Video Annotations
- URL: http://arxiv.org/abs/2104.00287v1
- Date: Thu, 1 Apr 2021 06:47:41 GMT
- Title: Learning to Track Instances without Video Annotations
- Authors: Yang Fu, Sifei Liu, Umar Iqbal, Shalini De Mello, Humphrey Shi, Jan
Kautz
- Abstract summary: We introduce a novel semi-supervised framework by learning instance tracking networks with only a labeled image dataset and unlabeled video sequences.
We show that even when only trained with images, the learned feature representation is robust to instance appearance variations.
In addition, we integrate this module into single-stage instance segmentation and pose estimation frameworks.
- Score: 85.9865889886669
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Tracking segmentation masks of multiple instances has been intensively
studied, but still faces two fundamental challenges: 1) the requirement of
large-scale, frame-wise annotation, and 2) the complexity of two-stage
approaches. To resolve these challenges, we introduce a novel semi-supervised
framework by learning instance tracking networks with only a labeled image
dataset and unlabeled video sequences. With an instance contrastive objective,
we learn an embedding to discriminate each instance from the others. We show
that even when only trained with images, the learned feature representation is
robust to instance appearance variations, and is thus able to track objects
steadily across frames. We further enhance the tracking capability of the
embedding by learning correspondence from unlabeled videos in a self-supervised
manner. In addition, we integrate this module into single-stage instance
segmentation and pose estimation frameworks, which significantly reduce the
computational complexity of tracking compared to two-stage networks. We conduct
experiments on the YouTube-VIS and PoseTrack datasets. Without any video
annotation efforts, our proposed method can achieve comparable or even better
performance than most fully-supervised methods.
Related papers
- Integrated Image-Text Based on Semi-supervised Learning for Small Sample Instance Segmentation [1.3157419797035321]
The article proposes a novel small sample instance segmentation solution from the perspective of maximizing the utilization of existing information.
First, it helps the model fully utilize unlabeled data by learning to generate pseudo labels, increasing the number of available samples.
Second, by integrating the features of text and image, more accurate classification results can be obtained.
arXiv Detail & Related papers (2024-10-21T14:44:08Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised
Framework with Spatio-Temporal Collaboration [13.284951215948052]
We present a novel weakly supervised framework with textbfS-patiotextbfTemporal textbfClaboration for instance textbfSegmentation in videos.
Our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN.
arXiv Detail & Related papers (2022-12-15T02:44:13Z) - Online Deep Clustering with Video Track Consistency [85.8868194550978]
We propose an unsupervised clustering-based approach to learn visual features from video object tracks.
We show that exploiting an unsupervised class-agnostic, yet noisy, track generator yields to better accuracy compared to relying on costly and precise track annotations.
arXiv Detail & Related papers (2022-06-07T08:11:00Z) - Tag-Based Attention Guided Bottom-Up Approach for Video Instance
Segmentation [83.13610762450703]
Video instance is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence.
We introduce a simple end-to-end train bottomable-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach.
Our method provides competitive results on YouTube-VIS and DAVIS-19 datasets, and has minimum run-time compared to other contemporary state-of-the-art performance methods.
arXiv Detail & Related papers (2022-04-22T15:32:46Z) - Semi-TCL: Semi-Supervised Track Contrastive Representation Learning [40.31083437957288]
We design a new instance-to-track matching objective to learn appearance embedding.
It compares a candidate detection to the embedding of the tracks persisted in the tracker.
We implement this learning objective in a unified form following the spirit of constrastive loss.
arXiv Detail & Related papers (2021-07-06T05:23:30Z) - Crop-Transform-Paste: Self-Supervised Learning for Visual Tracking [137.26381337333552]
In this work, we develop the Crop-Transform-Paste operation, which is able to synthesize sufficient training data.
Since the object state is known in all synthesized data, existing deep trackers can be trained in routine ways without human annotation.
arXiv Detail & Related papers (2021-06-21T07:40:34Z) - ASCNet: Self-supervised Video Representation Learning with
Appearance-Speed Consistency [62.38914747727636]
We study self-supervised video representation learning, which is a challenging task due to 1) a lack of labels for explicit supervision and 2) unstructured and noisy visual information.
Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other.
In this paper, we observe that the consistency between positive samples is the key to learn robust video representations.
arXiv Detail & Related papers (2021-06-04T08:44:50Z) - Train a One-Million-Way Instance Classifier for Unsupervised Visual
Representation Learning [45.510042484456854]
This paper presents a simple unsupervised visual representation learning method with a pretext task of discriminating all images in a dataset using a parametric, instance-level computation.
The overall framework is a replica of a supervised classification model, where semantic classes (e.g., dog, bird, and ship) are replaced by instance IDs.
scaling up the classification task from thousands of semantic labels to millions of instance labels brings specific challenges including 1) the large-scale softmax classifier; 2) the slow convergence due to the infrequent visiting of instance samples; and 3) the massive number of negative classes that can be noisy.
arXiv Detail & Related papers (2021-02-09T14:44:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.