Omni-sourced Webly-supervised Learning for Video Recognition
- URL: http://arxiv.org/abs/2003.13042v2
- Date: Tue, 25 Aug 2020 06:36:16 GMT
- Title: Omni-sourced Webly-supervised Learning for Video Recognition
- Authors: Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, Dahua Lin
- Abstract summary: We introduce OmniSource, a framework for leveraging web data to train video recognition models.
Experiments show that by utilizing data from multiple sources and formats, OmniSource is more data-efficient in training.
- Score: 74.3637061856504
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce OmniSource, a novel framework for leveraging web data to train
video recognition models. OmniSource overcomes the barriers between data
formats, such as images, short videos, and long untrimmed videos for
webly-supervised learning. First, data samples with multiple formats, curated
by task-specific data collection and automatically filtered by a teacher model,
are transformed into a unified form. Then a joint-training strategy is proposed
to deal with the domain gaps between multiple data sources and formats in
webly-supervised learning. Several good practices, including data balancing,
resampling, and cross-dataset mixup are adopted in joint training. Experiments
show that by utilizing data from multiple sources and formats, OmniSource is
more data-efficient in training. With only 3.5M images and 800K minutes videos
crawled from the internet without human labeling (less than 2% of prior works),
our models learned with OmniSource improve Top-1 accuracy of 2D- and 3D-ConvNet
baseline models by 3.0% and 3.9%, respectively, on the Kinetics-400 benchmark.
With OmniSource, we establish new records with different pretraining strategies
for video recognition. Our best models achieve 80.4%, 80.5%, and 83.6 Top-1
accuracies on the Kinetics-400 benchmark respectively for
training-from-scratch, ImageNet pre-training and IG-65M pre-training.
Related papers
- Building an Open-Vocabulary Video CLIP Model with Better Architectures,
Optimization and Data [102.0069667710562]
This paper presents Open-VCLIP++, a framework that adapts CLIP to a strong zero-shot video classifier.
We demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data.
Our approach is evaluated on three widely used action recognition datasets.
arXiv Detail & Related papers (2023-10-08T04:46:43Z) - Identifying Misinformation on YouTube through Transcript Contextual
Analysis with Transformer Models [1.749935196721634]
We introduce a novel methodology for video classification, focusing on the veracity of the content.
We employ advanced machine learning techniques like transfer learning to solve the classification challenge.
We apply the trained models to three datasets: (a) YouTube Vaccine-misinformation related videos, (b) YouTube Pseudoscience videos, and (c) Fake-News dataset.
arXiv Detail & Related papers (2023-07-22T19:59:16Z) - VideoMAE V2: Scaling Video Masked Autoencoders with Dual Masking [57.552798046137646]
Video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models.
We successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance.
arXiv Detail & Related papers (2023-03-29T14:28:41Z) - Revisiting Classifier: Transferring Vision-Language Models for Video
Recognition [102.93524173258487]
Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research.
In this study, we focus on transferring knowledge for video classification tasks.
We utilize the well-pretrained language model to generate good semantic target for efficient transferring learning.
arXiv Detail & Related papers (2022-07-04T10:00:47Z) - Jigsaw Clustering for Unsupervised Visual Representation Learning [68.09280490213399]
We propose a new jigsaw clustering pretext task in this paper.
Our method makes use of information from both intra- and inter-images.
It is even comparable to the contrastive learning methods when only half of training batches are used.
arXiv Detail & Related papers (2021-04-01T08:09:26Z) - Spatiotemporal Contrastive Video Representation Learning [87.56145031149869]
We present a self-supervised Contrastive Video Representation Learning (CVRL) method to learn visual representations from unlabeled videos.
Our representations are learned using a contrasttemporalive loss, where two augmented clips from the same short video are pulled together in the embedding space.
We study what makes for good data augmentations for video self-supervised learning and find that both spatial and temporal information are crucial.
arXiv Detail & Related papers (2020-08-09T19:58:45Z) - Creating a Large-scale Synthetic Dataset for Human Activity Recognition [0.8250374560598496]
We use 3D rendering tools to generate a synthetic dataset of videos, and show that a classifier trained on these videos can generalise to real videos.
We fine tune a pre-trained I3D model on our videos, and find that the model is able to achieve a high accuracy of 73% on the HMDB51 dataset over three classes.
arXiv Detail & Related papers (2020-07-21T22:20:21Z) - Unified Image and Video Saliency Modeling [21.701431656717112]
We ask: Can image and video saliency modeling be approached via a unified model?
We propose four novel domain adaptation techniques and an improved formulation of learned Gaussian priors.
We integrate these techniques into a simple and lightweight encoder-RNN-decoder-style network, UNISAL, and train it jointly with image and video saliency data.
We evaluate our method on the video saliency datasets DHF1K, Hollywood-2 and UCF-Sports, and the image saliency datasets SALICON and MIT300.
arXiv Detail & Related papers (2020-03-11T18:28:29Z) - Rethinking Zero-shot Video Classification: End-to-end Training for
Realistic Applications [26.955001807330497]
Zero-shot learning (ZSL) trains a model once and generalizes to new tasks whose classes are not present in the training dataset.
We propose the first end-to-end algorithm for ZSL in video classification.
Our training procedure builds on insights from recent video classification literature and uses a trainable 3D CNN to learn the visual features.
arXiv Detail & Related papers (2020-03-03T11:09:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.