NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy
Labels
- URL: http://arxiv.org/abs/2110.06827v1
- Date: Wed, 13 Oct 2021 16:12:18 GMT
- Title: NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy
Labels
- Authors: Mohit Sharma, Raj Patra, Harshal Desai, Shruti Vyas, Yogesh Rawat and
Rajiv Ratn Shah
- Abstract summary: We create a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information.
We show how a network pretrained on the proposed dataset can help against video corruption and label noise in downstream datasets.
- Score: 33.659146748289444
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Deep learning has shown remarkable progress in a wide range of problems.
However, efficient training of such models requires large-scale datasets, and
getting annotations for such datasets can be challenging and costly. In this
work, we explore the use of user-generated freely available labels from web
videos for video understanding. We create a benchmark dataset consisting of
around 2 million videos with associated user-generated annotations and other
meta information. We utilize the collected dataset for action classification
and demonstrate its usefulness with existing small-scale annotated datasets,
UCF101 and HMDB51. We study different loss functions and two pretraining
strategies, simple and self-supervised learning. We also show how a network
pretrained on the proposed dataset can help against video corruption and label
noise in downstream datasets. We present this as a benchmark dataset in noisy
learning for video understanding. The dataset, code, and trained models will be
publicly available for future research.
Related papers
- Towards Student Actions in Classroom Scenes: New Dataset and Baseline [43.268586725768465]
We present a new multi-label student action video (SAV) dataset for complex classroom scenes.
The dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, each labeled with 15 different actions displayed by students in classrooms.
arXiv Detail & Related papers (2024-09-02T03:44:24Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - infoVerse: A Universal Framework for Dataset Characterization with
Multidimensional Meta-information [68.76707843019886]
infoVerse is a universal framework for dataset characterization.
infoVerse captures multidimensional characteristics of datasets by incorporating various model-driven meta-information.
In three real-world applications (data pruning, active learning, and data annotation), the samples chosen on infoVerse space consistently outperform strong baselines.
arXiv Detail & Related papers (2023-05-30T18:12:48Z) - Audio-visual Generalised Zero-shot Learning with Cross-modal Attention
and Language [38.02396786726476]
We propose to learn multi-modal representations from audio-visual data using cross-modal attention.
In our generalised audio-visual zero-shot learning setting, we include all the training classes in the test-time search space.
Due to the lack of a unified benchmark in this domain, we introduce a (generalised) zero-shot learning benchmark on three audio-visual datasets.
arXiv Detail & Related papers (2022-03-07T18:52:13Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z) - Automatic Curation of Large-Scale Datasets for Audio-Visual
Representation Learning [62.47593143542552]
We describe a subset optimization approach for automatic dataset curation.
We demonstrate that our approach finds videos with high audio-visual correspondence and show that self-supervised models trained on our data, despite being automatically constructed, achieve similar downstream performances to existing video datasets with similar scales.
arXiv Detail & Related papers (2021-01-26T14:27:47Z) - Comprehensive Instructional Video Analysis: The COIN Dataset and
Performance Evaluation [100.68317848808327]
We present a large-scale dataset named as "COIN" for COmprehensive INstructional video analysis.
COIN dataset contains 11,827 videos of 180 tasks in 12 domains related to our daily life.
With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries.
arXiv Detail & Related papers (2020-03-20T16:59:44Z) - VideoSSL: Semi-Supervised Learning for Video Classification [30.348819309923098]
We propose a semi-supervised learning approach for video classification, VideoSSL, using convolutional neural networks (CNN)
To minimize the dependence on a large annotated dataset, our proposed method trains from a small number of labeled examples.
We show that, under the supervision of these guiding signals from unlabeled examples, a video classification CNN can achieve impressive performances.
arXiv Detail & Related papers (2020-02-29T07:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.