The AVA-Kinetics Localized Human Actions Video Dataset
- URL: http://arxiv.org/abs/2005.00214v2
- Date: Wed, 20 May 2020 17:40:28 GMT
- Title: The AVA-Kinetics Localized Human Actions Video Dataset
- Authors: Ang Li, Meghana Thotakuri, David A. Ross, Jo\~ao Carreira, Alexander
Vostrikov, Andrew Zisserman
- Abstract summary: This paper describes the AVA-Kinetics localized human actions video dataset.
The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol.
The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames.
- Score: 124.41706958756049
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the AVA-Kinetics localized human actions video dataset.
The dataset is collected by annotating videos from the Kinetics-700 dataset
using the AVA annotation protocol, and extending the original AVA dataset with
these new AVA annotated Kinetics clips. The dataset contains over 230k clips
annotated with the 80 AVA action classes for each of the humans in key-frames.
We describe the annotation process and provide statistics about the new
dataset. We also include a baseline evaluation using the Video Action
Transformer Network on the AVA-Kinetics dataset, demonstrating improved
performance for action classification on the AVA test set. The dataset can be
downloaded from https://research.google.com/ava/
Related papers
- PHEVA: A Privacy-preserving Human-centric Video Anomaly Detection Dataset [2.473948454680334]
PHEVA safeguards personally identifiable information by removing pixel information and providing only de-identified human annotations.
This study benchmarks state-of-the-art methods on PHEVA using a comprehensive set of metrics, including the 10% Error Rate (10ER)
As the first of its kind, PHEVA bridges the gap between conventional training and real-world deployment by introducing continual learning benchmarks.
arXiv Detail & Related papers (2024-08-26T14:55:23Z) - OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos [58.5538620720541]
The dataset, OVR, contains annotations for over 72K videos.
OVR is almost an order of magnitude larger than previous datasets for video repetition.
We propose a baseline transformer-based counting model, OVRCounter, that can count repetitions in videos up to 320 frames long.
arXiv Detail & Related papers (2024-07-24T08:22:49Z) - TK-Planes: Tiered K-Planes with High Dimensional Feature Vectors for Dynamic UAV-based Scenes [58.180556221044235]
We present a new approach to bridge the domain gap between synthetic and real-world data for unmanned aerial vehicle (UAV)-based perception.
Our formulation is designed for dynamic scenes, consisting of small moving objects or human actions.
We evaluate its performance on challenging datasets, including Okutama Action and UG2.
arXiv Detail & Related papers (2024-05-04T21:55:33Z) - DAM: Dynamic Adapter Merging for Continual Video QA Learning [66.43360542692355]
We present a parameter-efficient method for continual video question-answering (VidQA) learning.
Our method uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, and (iv) enable knowledge sharing across similar dataset domains.
Our DAM model outperforms prior state-of-the-art continual learning approaches by 9.1% while exhibiting 1.9% less forgetting on 6 VidQA datasets spanning various domains.
arXiv Detail & Related papers (2024-03-13T17:53:47Z) - Learning the What and How of Annotation in Video Object Segmentation [11.012995995497029]
Video Object (VOS) is crucial for several applications, from video editing to video data generation.
Traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame.
We propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation.
arXiv Detail & Related papers (2023-11-08T00:56:31Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - MITFAS: Mutual Information based Temporal Feature Alignment and Sampling
for Aerial Video Action Recognition [59.905048445296906]
We present a novel approach for action recognition in UAV videos.
We use the concept of mutual information to compute and align the regions corresponding to human action or motion in the temporal domain.
In practice, we achieve 18.9% improvement in Top-1 accuracy over current state-of-the-art methods.
arXiv Detail & Related papers (2023-03-05T04:05:17Z) - A Short Note on the Kinetics-700-2020 Human Action Dataset [0.0]
We describe the 2020 edition of the DeepMind Kinetics human action dataset.
In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes.
arXiv Detail & Related papers (2020-10-21T09:47:09Z) - Learning Visual Voice Activity Detection with an Automatically Annotated
Dataset [20.725871972294236]
Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not.
We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow.
We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD.
arXiv Detail & Related papers (2020-09-23T15:12:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.