OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
- URL: http://arxiv.org/abs/2407.17085v1
- Date: Wed, 24 Jul 2024 08:22:49 GMT
- Title: OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
- Authors: Debidatta Dwibedi, Yusuf Aytar, Jonathan Tompson, Andrew Zisserman,
- Abstract summary: The dataset, OVR, contains annotations for over 72K videos.
OVR is almost an order of magnitude larger than previous datasets for video repetition.
We propose a baseline transformer-based counting model, OVRCounter, that can count repetitions in videos up to 320 frames long.
- Score: 58.5538620720541
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce a dataset of annotations of temporal repetitions in videos. The dataset, OVR (pronounced as over), contains annotations for over 72K videos, with each annotation specifying the number of repetitions, the start and end time of the repetitions, and also a free-form description of what is repeating. The annotations are provided for videos sourced from Kinetics and Ego4D, and consequently cover both Exo and Ego viewing conditions, with a huge variety of actions and activities. Moreover, OVR is almost an order of magnitude larger than previous datasets for video repetition. We also propose a baseline transformer-based counting model, OVRCounter, that can localise and count repetitions in videos that are up to 320 frames long. The model is trained and evaluated on the OVR dataset, and its performance assessed with and without using text to specify the target class to count. The performance is also compared to a prior repetition counting model. The dataset is available for download at: https://sites.google.com/view/openvocabreps/
Related papers
- Every Shot Counts: Using Exemplars for Repetition Counting in Videos [66.1933685445448]
We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos.
Our proposed Every Shot Counts (ESCounts) model is an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos.
arXiv Detail & Related papers (2024-03-26T19:54:21Z) - DeVAn: Dense Video Annotation for Video-Language Models [68.70692422636313]
We present a novel human annotated dataset for evaluating the ability for visual-language models to generate descriptions for real-world video clips.
The dataset contains 8.5K YouTube video clips of 20-60 seconds in duration and covers a wide range of topics and interests.
arXiv Detail & Related papers (2023-10-08T08:02:43Z) - Spoken Moments: Learning Joint Audio-Visual Representations from Video
Descriptions [75.77044856100349]
We present the Spoken Moments dataset of 500k spoken captions each attributed to a unique short video depicting a broad range of different events.
We show that our AMM approach consistently improves our results and that models trained on our Spoken Moments dataset generalize better than those trained on other video-caption datasets.
arXiv Detail & Related papers (2021-05-10T16:30:46Z) - Counting Out Time: Class Agnostic Video Repetition Counting in the Wild [82.26003709476848]
We present an approach for estimating the period with which an action is repeated in a video.
The crux of the approach lies in constraining the period prediction module to use temporal self-similarity.
We train this model, called Repnet, with a synthetic dataset that is generated from a large unlabeled video collection.
arXiv Detail & Related papers (2020-06-27T18:00:42Z) - TVR: A Large-Scale Dataset for Video-Subtitle Moment Retrieval [111.93601253692165]
TV show Retrieval (TVR) is a new multimodal retrieval dataset.
TVR requires systems to understand both videos and their associated subtitle (dialogue) texts.
The dataset contains 109K queries collected on 21.8K videos from 6 TV shows of diverse genres.
arXiv Detail & Related papers (2020-01-24T17:09:39Z) - EEV: A Large-Scale Dataset for Studying Evoked Expressions from Video [23.95850953376425]
Evoked Expressions from Videos dataset is a large-scale dataset for studying viewer responses to videos.
Each video is annotated at 6 Hz with 15 continuous evoked expression labels, corresponding to the facial expression of viewers who reacted to the video.
There are 36.7 million annotations of viewer facial reactions to 23,574 videos (1,700 hours)
arXiv Detail & Related papers (2020-01-15T18:59:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.