Related papers: Labelling unlabelled videos from scratch with multi-modal self-supervision

Labelling unlabelled videos from scratch with multi-modal self-supervision

URL: http://arxiv.org/abs/2006.13662v3
Date: Sun, 28 Feb 2021 14:45:24 GMT
Title: Labelling unlabelled videos from scratch with multi-modal self-supervision
Authors: Yuki M. Asano, Mandela Patrick, Christian Rupprecht, Andrea Vedaldi
Abstract summary: unsupervised labelling of a video dataset does not come for free from strong feature encoders. We propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels.
Score: 82.60652426371936
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A large part of the current success of deep learning lies in the effectiveness of data -- more precisely: labelled data. Yet, labelling a dataset with human annotation continues to carry high costs, especially for videos. While in the image domain, recent methods have allowed to generate meaningful (pseudo-) labels for unlabelled datasets without supervision, this development is missing for the video domain where learning feature representations is the current focus. In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudo-labelling of a video dataset without any human annotations, by leveraging the natural correspondence between the audio and visual modalities. An extensive analysis shows that the resulting clusters have high semantic overlap to ground truth human labels. We further introduce the first benchmarking results on unsupervised labelling of common video datasets Kinetics, Kinetics-Sound, VGG-Sound and AVE.

Related papers

Enhancing Multi-Modal Video Sentiment Classification Through Semi-Supervised Clustering [0.0]
We aim to improve video sentiment classification by focusing on two key aspects: the video itself, the accompanying text, and the acoustic features. We are developing a method that utilizes clustering-based semi-supervised pre-training to extract meaningful representations from the data.
arXiv Detail & Related papers (2025-01-11T08:04:39Z)
Query-based Video Summarization with Pseudo Label Supervision [19.229722872058055]
Existing datasets for manually labelled query-based video summarization are costly and thus small. Self-supervision can address the data sparsity challenge by using a pretext task and defining a method to acquire extra data with pseudo labels. Experimental results show that the proposed video summarization algorithm achieves state-of-the-art performance.
arXiv Detail & Related papers (2023-07-04T22:28:17Z)
HighlightMe: Detecting Highlights from Human-Centric Videos [52.84233165201391]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos. We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions. We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z)
Large-Scale Unsupervised Person Re-Identification with Contrastive Learning [17.04597303816259]
Most existing unsupervised and domain adaptation ReID methods utilize only the public datasets in their experiments. Inspired by the recent progress of large-scale self-supervised image classification using contrastive learning, we propose to learn ReID representation from large-scale unlabeled surveillance video alone.
arXiv Detail & Related papers (2021-05-17T14:55:08Z)
Cleaning Label Noise with Clusters for Minimally Supervised Anomaly Detection [26.062659852373653]
We formulate a weakly supervised anomaly detection method that is trained using only video-level labels. The proposed method yields 78.27% and 84.16% frame-level AUC on UCF-crime and ShanghaiTech datasets respectively.
arXiv Detail & Related papers (2021-04-30T06:03:24Z)
Automatic Curation of Large-Scale Datasets for Audio-Visual Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation. We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z)
Reducing the Annotation Effort for Video Object Segmentation Datasets [50.893073670389164]
densely labeling every frame with pixel masks does not scale to large datasets. We use a deep convolutional network to automatically create pseudo-labels on a pixel level from much cheaper bounding box annotations. We obtain the new TAO-VOS benchmark, which we make publicly available at www.vision.rwth-aachen.de/page/taovos.
arXiv Detail & Related papers (2020-11-02T17:34:45Z)
Semantics through Time: Semi-supervised Segmentation of Aerial Videos with Iterative Label Propagation [16.478668565965243]
This paper makes an important step towards automatic annotation by introducing SegProp. SegProp is a novel iterative flow-based method, with a direct connection to spectral clustering in space and time. We introduce Ruralscapes, a new dataset with high resolution (4K) images and manually-annotated dense labels every 50 frames. Our novel SegProp automatically annotates the remaining unlabeled 98% of frames with an accuracy exceeding 90%.
arXiv Detail & Related papers (2020-10-02T15:15:50Z)
Adversarial Knowledge Transfer from Unlabeled Data [62.97253639100014]
We present a novel Adversarial Knowledge Transfer framework for transferring knowledge from internet-scale unlabeled data to improve the performance of a classifier. An important novel aspect of our method is that the unlabeled source data can be of different classes from those of the labeled target data, and there is no need to define a separate pretext task.
arXiv Detail & Related papers (2020-08-13T08:04:27Z)
Evolving Losses for Unsupervised Video Representation Learning [91.2683362199263]
We present a new method to learn video representations from large-scale unlabeled video data. The proposed unsupervised representation learning results in a single RGB network and outperforms previous methods.
arXiv Detail & Related papers (2020-02-26T16:56:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.