Labelling unlabelled videos from scratch with multi-modal
self-supervision
- URL: http://arxiv.org/abs/2006.13662v3
- Date: Sun, 28 Feb 2021 14:45:24 GMT
- Title: Labelling unlabelled videos from scratch with multi-modal
self-supervision
- Authors: Yuki M. Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi
- Abstract summary: unsupervised labelling of a video dataset does not come for free from strong feature encoders.
We propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations.
An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels.
- Score: 82.60652426371936
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A large part of the current success of deep learning lies in the
effectiveness of data -- more precisely: labelled data. Yet, labelling a
dataset with human annotation continues to carry high costs, especially for
videos. While in the image domain, recent methods have allowed to generate
meaningful (pseudo-) labels for unlabelled datasets without supervision, this
development is missing for the video domain where learning feature
representations is the current focus. In this work, we a) show that
unsupervised labelling of a video dataset does not come for free from strong
feature encoders and b) propose a novel clustering method that allows
pseudo-labelling of a video dataset without any human annotations, by
leveraging the natural correspondence between the audio and visual modalities.
An extensive analysis shows that the resulting clusters have high semantic
overlap to ground truth human labels. We further introduce the first
benchmarking results on unsupervised labelling of common video datasets
Kinetics, Kinetics-Sound, VGG-Sound and AVE.
Related papers
- Query-based Video Summarization with Pseudo Label Supervision [19.229722872058055]
Existing datasets for manually labelled query-based video summarization are costly and thus small.
Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels.
Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-04T22:28:17Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [62.265410865423]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - Large-Scale Unsupervised Person Re-Identification with Contrastive
Learning [17.04597303816259]
Most existing unsupervised and domain adaptation ReID methods utilize only the public datasets in their experiments.
Inspired by the recent progress of large-scale self-supervised image classification using contrastive learning, we propose to learn ReID representation from large-scale unlabeled surveillance video alone.
arXiv Detail & Related papers (2021-05-17T14:55:08Z) - Cleaning Label Noise with Clusters for Minimally Supervised Anomaly
Detection [26.062659852373653]
We formulate a weakly supervised anomaly detection method that is trained using only video-level labels.
The proposed method yields 78.27% and 84.16% frame-level AUC on UCF-crime and ShanghaiTech datasets respectively.
arXiv Detail & Related papers (2021-04-30T06:03:24Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - Reducing the Annotation Effort for Video Object Segmentation Datasets [50.893073670389164]
densely labeling every frame with pixel masks does not scale to large datasets.
We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations.
We obtain the new TAO-VOS benchmark, which we make publicly available at www.vision.rwth-aachen.de/page/taovos.
arXiv Detail & Related papers (2020-11-02T17:34:45Z) - Semantics through Time: Semi-supervised Segmentation of Aerial Videos
with Iterative Label Propagation [16.478668565965243]
This paper makes an important step towards automatic annotation by introducing SegProp.
SegProp is a novel iterative flow-based method, with a direct connection to spectral clustering in space and time.
We introduce Ruralscapes, a new dataset with high resolution (4K) images and manually-annotated dense labels every 50 frames.
Our novel SegProp automatically annotates the remaining unlabeled 98% of frames with an accuracy exceeding 90%.
arXiv Detail & Related papers (2020-10-02T15:15:50Z) - Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier.
An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z) - Evolving Losses for Unsupervised Video Representation Learning [91.2683362199263]
We present a new method to learn video representations from large-scale unlabeled video data.
The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods.
arXiv Detail & Related papers (2020-02-26T16:56:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.