Comprehensive Instructional Video Analysis: The COIN Dataset and
Performance Evaluation
- URL: http://arxiv.org/abs/2003.09392v1
- Date: Fri, 20 Mar 2020 16:59:44 GMT
- Title: Comprehensive Instructional Video Analysis: The COIN Dataset and
Performance Evaluation
- Authors: Yansong Tang and Jiwen Lu and Jie Zhou
- Abstract summary: We present a large-scale dataset named as "COIN" for COmprehensive INstructional video analysis.
COIN dataset contains 11,827 videos of 180 tasks in 12 domains related to our daily life.
With a new developed toolbox, all the videos are annotated efficiently with a series of step labels and the corresponding temporal boundaries.
- Score: 100.68317848808327
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Thanks to the substantial and explosively inscreased instructional videos on
the Internet, novices are able to acquire knowledge for completing various
tasks. Over the past decade, growing efforts have been devoted to investigating
the problem on instructional video analysis. However, the most existing
datasets in this area have limitations in diversity and scale, which makes them
far from many real-world applications where more diverse activities occur. To
address this, we present a large-scale dataset named as "COIN" for
COmprehensive INstructional video analysis. Organized with a hierarchical
structure, the COIN dataset contains 11,827 videos of 180 tasks in 12 domains
(e.g., vehicles, gadgets, etc.) related to our daily life. With a new developed
toolbox, all the videos are annotated efficiently with a series of step labels
and the corresponding temporal boundaries. In order to provide a benchmark for
instructional video analysis, we evaluate plenty of approaches on the COIN
dataset under five different settings. Furthermore, we exploit two important
characteristics (i.e., task-consistency and ordering-dependency) for localizing
important steps in instructional videos. Accordingly, we propose two simple yet
effective methods, which can be easily plugged into conventional proposal-based
action detection models. We believe the introduction of the COIN dataset will
promote the future in-depth research on instructional video analysis for the
community. Our dataset, annotation toolbox and source code are available at
http://coin-dataset.github.io.
Related papers
- HAVANA: Hierarchical stochastic neighbor embedding for Accelerated Video ANnotAtions [59.71751978599567]
This paper presents a novel annotation pipeline that uses pre-extracted features and dimensionality reduction to accelerate the temporal video annotation process.
We demonstrate significant improvements in annotation effort compared to traditional linear methods, achieving more than a 10x reduction in clicks required for annotating over 12 hours of video.
arXiv Detail & Related papers (2024-09-16T18:15:38Z) - Towards Student Actions in Classroom Scenes: New Dataset and Baseline [43.268586725768465]
We present a new multi-label student action video (SAV) dataset for complex classroom scenes.
The dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, each labeled with 15 different actions displayed by students in classrooms.
arXiv Detail & Related papers (2024-09-02T03:44:24Z) - CinePile: A Long Video Question Answering Dataset and Benchmark [55.30860239555001]
We present a novel dataset and benchmark, CinePile, specifically designed for authentic long-form video understanding.
Our comprehensive dataset comprises 305,000 multiple-choice questions (MCQs), covering various visual and multimodal aspects.
We fine-tuned open-source Video-LLMs on the training split and evaluated both open-source and proprietary video-centric LLMs on the test split of our dataset.
arXiv Detail & Related papers (2024-05-14T17:59:02Z) - A Large-Scale Analysis on Self-Supervised Video Representation Learning [15.205738030787673]
We study five different aspects of self-supervised learning important for videos; 1) dataset size, 2) complexity, 3) data distribution, 4) data noise, and, 5)feature analysis.
We present several interesting insights from this study which span across different properties of pretraining and target datasets, pretext-tasks, and model architectures.
We propose an approach that requires a limited amount of training data and outperforms existing state-of-the-art approaches which use 10x pretraining data.
arXiv Detail & Related papers (2023-06-09T16:27:14Z) - NoisyActions2M: A Multimedia Dataset for Video Understanding from Noisy
Labels [33.659146748289444]
We create a benchmark dataset consisting of around 2 million videos with associated user-generated annotations and other meta information.
We show how a network pretrained on the proposed dataset can help against video corruption and label noise in downstream datasets.
arXiv Detail & Related papers (2021-10-13T16:12:18Z) - VALUE: A Multi-Task Benchmark for Video-and-Language Understanding
Evaluation [124.02278735049235]
VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels.
We evaluate various baseline methods with and without large-scale VidL pre-training.
The significant gap between our best model and human performance calls for future study for advanced VidL models.
arXiv Detail & Related papers (2021-06-08T18:34:21Z) - STEP: Segmenting and Tracking Every Pixel [107.23184053133636]
We present a new benchmark: Segmenting and Tracking Every Pixel (STEP)
Our work is the first that targets this task in a real-world setting that requires dense interpretation in both spatial and temporal domains.
For measuring the performance, we propose a novel evaluation metric and Tracking Quality (STQ)
arXiv Detail & Related papers (2021-02-23T18:43:02Z) - VLEngagement: A Dataset of Scientific Video Lectures for Evaluating
Population-based Engagement [23.078055803229912]
Video lectures have become one of the primary modalities to impart knowledge to masses in the current digital age.
There is still an important need for data and research aimed at understanding learner engagement with scientific video lectures.
This paper introduces VLEngagement, a novel dataset that consists of content-based and video-specific features extracted from publicly available scientific video lectures.
arXiv Detail & Related papers (2020-11-02T14:20:19Z) - TAO: A Large-Scale Benchmark for Tracking Any Object [95.87310116010185]
Tracking Any Object dataset consists of 2,907 high resolution videos, captured in diverse environments, which are half a minute long on average.
We ask annotators to label objects that move at any point in the video, and give names to them post factum.
Our vocabulary is both significantly larger and qualitatively different from existing tracking datasets.
arXiv Detail & Related papers (2020-05-20T21:07:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.