The AVA-Kinetics Localized Human Actions Video Dataset
- URL: http://arxiv.org/abs/2005.00214v2
- Date: Wed, 20 May 2020 17:40:28 GMT
- Title: The AVA-Kinetics Localized Human Actions Video Dataset
- Authors: Ang Li, Meghana Thotakuri, David A. Ross, Jo\~ao Carreira, Alexander
Vostrikov, Andrew Zisserman
- Abstract summary: This paper describes the AVA-Kinetics localized human actions video dataset.
The dataset is collected by annotating videos from the Kinetics-700 dataset using the AVA annotation protocol.
The dataset contains over 230k clips annotated with the 80 AVA action classes for each of the humans in key-frames.
- Score: 124.41706958756049
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper describes the AVA-Kinetics localized human actions video dataset.
The dataset is collected by annotating videos from the Kinetics-700 dataset
using the AVA annotation protocol, and extending the original AVA dataset with
these new AVA annotated Kinetics clips. The dataset contains over 230k clips
annotated with the 80 AVA action classes for each of the humans in key-frames.
We describe the annotation process and provide statistics about the new
dataset. We also include a baseline evaluation using the Video Action
Transformer Network on the AVA-Kinetics dataset, demonstrating improved
performance for action classification on the AVA test set. The dataset can be
downloaded from https://research.google.com/ava/
Related papers
- OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos [58.5538620720541]
The dataset, OVR, contains annotations for over 72K videos.
OVR is almost an order of magnitude larger than previous datasets for video repetition.
We propose a baseline transformer-based counting model, OVRCounter, that can count repetitions in videos up to 320 frames long.
arXiv Detail & Related papers (2024-07-24T08:22:49Z) - DAM: Dynamic Adapter Merging for Continual Video QA Learning [66.43360542692355]
We present a parameter-efficient method for continual video question-answering (VidQA) learning.
Our method uses the proposed Dynamic Adapter Merging to (i) mitigate catastrophic forgetting, (ii) enable efficient adaptation to continually arriving datasets, and (iv) enable knowledge sharing across similar dataset domains.
Our DAM model outperforms prior state-of-the-art continual learning approaches by 9.1% while exhibiting 1.9% less forgetting on 6 VidQA datasets spanning various domains.
arXiv Detail & Related papers (2024-03-13T17:53:47Z) - Learning the What and How of Annotation in Video Object Segmentation [11.012995995497029]
Video Object (VOS) is crucial for several applications, from video editing to video data generation.
Traditional way of annotating objects requires humans to draw detailed segmentation masks on the target objects at each video frame.
We propose EVA-VOS, a human-in-the-loop annotation framework for video object segmentation.
arXiv Detail & Related papers (2023-11-08T00:56:31Z) - VEATIC: Video-based Emotion and Affect Tracking in Context Dataset [34.77364955121413]
We introduce a brand new large dataset, the Video-based Emotion and Affect Tracking in Context dataset (VEATIC)
VEATIC has 124 video clips from Hollywood movies, documentaries, and home videos with continuous valence and arousal ratings of each frame via real-time annotation.
Along with the dataset, we propose a new computer vision task to infer the affect of the selected character via both context and character information in each video frame.
arXiv Detail & Related papers (2023-09-13T06:31:35Z) - MITFAS: Mutual Information based Temporal Feature Alignment and Sampling
for Aerial Video Action Recognition [59.905048445296906]
We present a novel approach for action recognition in UAV videos.
We use the concept of mutual information to compute and align the regions corresponding to human action or motion in the temporal domain.
In practice, we achieve 18.9% improvement in Top-1 accuracy over current state-of-the-art methods.
arXiv Detail & Related papers (2023-03-05T04:05:17Z) - HighlightMe: Detecting Highlights from Human-Centric Videos [62.265410865423]
We present a domain- and user-preference-agnostic approach to detect highlightable excerpts from human-centric videos.
We use an autoencoder network equipped with spatial-temporal graph convolutions to detect human activities and interactions.
We observe a 4-12% improvement in the mean average precision of matching the human-annotated highlights over state-of-the-art methods.
arXiv Detail & Related papers (2021-10-05T01:18:15Z) - The Role of the Input in Natural Language Video Description [60.03448250024277]
Natural Language Video Description (NLVD) has recently received strong interest in the Computer Vision, Natural Language Processing, Multimedia, and Autonomous Robotics communities.
In this work, it is presented an extensive study dealing with the role of the visual input, evaluated with respect to the overall NLP performance.
A t-SNE based analysis is proposed to evaluate the effects of the considered transformations on the overall visual data distribution.
arXiv Detail & Related papers (2021-02-09T19:00:35Z) - A Short Note on the Kinetics-700-2020 Human Action Dataset [0.0]
We describe the 2020 edition of the DeepMind Kinetics human action dataset.
In this new version, there are at least 700 video clips from different YouTube videos for each of the 700 classes.
arXiv Detail & Related papers (2020-10-21T09:47:09Z) - Learning Visual Voice Activity Detection with an Automatically Annotated
Dataset [20.725871972294236]
Visual voice activity detection (V-VAD) uses visual features to predict whether a person is speaking or not.
We propose two deep architectures for V-VAD, one based on facial landmarks and one based on optical flow.
We introduce a novel methodology to automatically create and annotate very large datasets in-the-wild -- WildVVAD.
arXiv Detail & Related papers (2020-09-23T15:12:24Z) - q-VAE for Disentangled Representation Learning and Latent Dynamical
Systems [8.071506311915396]
A variational autoencoder (VAE) derived from Tsallis statistics called q-VAE is proposed.
In the proposed method, a standard VAE is employed to statistically extract latent space hidden in sampled data.
arXiv Detail & Related papers (2020-03-04T01:38:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.