Every Shot Counts: Using Exemplars for Repetition Counting in Videos
- URL: http://arxiv.org/abs/2403.18074v2
- Date: Sun, 13 Oct 2024 06:54:24 GMT
- Title: Every Shot Counts: Using Exemplars for Repetition Counting in Videos
- Authors: Saptarshi Sinha, Alexandros Stergiou, Dima Damen,
- Abstract summary: We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos.
Our proposed Every Shot Counts (ESCounts) model is an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos.
- Score: 66.1933685445448
- License:
- Abstract: Video repetition counting infers the number of repetitions of recurring actions or motion within a video. We propose an exemplar-based approach that discovers visual correspondence of video exemplars across repetitions within target videos. Our proposed Every Shot Counts (ESCounts) model is an attention-based encoder-decoder that encodes videos of varying lengths alongside exemplars from the same and different videos. In training, ESCounts regresses locations of high correspondence to the exemplars within the video. In tandem, our method learns a latent that encodes representations of general repetitive motions, which we use for exemplar-free, zero-shot inference. Extensive experiments over commonly used datasets (RepCount, Countix, and UCFRep) showcase ESCounts obtaining state-of-the-art performance across all three datasets. Detailed ablations further demonstrate the effectiveness of our method.
Related papers
- OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos [58.5538620720541]
The dataset, OVR, contains annotations for over 72K videos.
OVR is almost an order of magnitude larger than previous datasets for video repetition.
We propose a baseline transformer-based counting model, OVRCounter, that can count repetitions in videos up to 320 frames long.
arXiv Detail & Related papers (2024-07-24T08:22:49Z) - Skim then Focus: Integrating Contextual and Fine-grained Views for Repetitive Action Counting [87.11995635760108]
Key to action counting is accurately locating each video's repetitive actions.
We propose a dual-branch network, i.e., SkimFocusNet, working in a two-step manner.
arXiv Detail & Related papers (2024-06-13T05:15:52Z) - Full Resolution Repetition Counting [19.676724611655914]
Given an untrimmed video, repetitive actions counting aims to estimate the number of repetitions of class-agnostic actions.
Down-sampling is commonly utilized in recent state-of-the-art methods, leading to ignorance of several repetitive samples.
In this paper, we attempt to understand repetitive actions from a full temporal resolution view, by combining offline feature extraction and temporal convolution networks.
arXiv Detail & Related papers (2023-05-23T07:45:56Z) - Reading-strategy Inspired Visual Representation Learning for
Text-to-Video Retrieval [41.420760047617506]
Cross-modal representation learning projects both videos and sentences into common spaces for semantic similarity.
Inspired by the reading strategy of humans, we propose a Reading-strategy Inspired Visual Representation Learning (RIVRL) to represent videos.
Our model RIVRL achieves a new state-of-the-art on TGIF and VATEX.
arXiv Detail & Related papers (2022-01-23T03:38:37Z) - Composable Augmentation Encoding for Video Representation Learning [94.2358972764708]
We focus on contrastive methods for self-supervised video representation learning.
A common paradigm in contrastive learning is to construct positive pairs by sampling different data views for the same instance, with different data instances as negatives.
We propose an 'augmentation aware' contrastive learning framework, where we explicitly provide a sequence of augmentation parameterisations.
We show that our method encodes valuable information about specified spatial or temporal augmentation, and in doing so also achieve state-of-the-art performance on a number of video benchmarks.
arXiv Detail & Related papers (2021-04-01T16:48:53Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z) - Counting Out Time: Class Agnostic Video Repetition Counting in the Wild [82.26003709476848]
We present an approach for estimating the period with which an action is repeated in a video.
The crux of the approach lies in constraining the period prediction module to use temporal self-similarity.
We train this model, called Repnet, with a synthetic dataset that is generated from a large unlabeled video collection.
arXiv Detail & Related papers (2020-06-27T18:00:42Z) - Deep Multimodal Feature Encoding for Video Ordering [34.27175264084648]
We present a way to learn a compact multimodal feature representation that encodes all these modalities.
Our model parameters are learned through a proxy task of inferring the temporal ordering of a set of unordered videos in a timeline.
We analyze and evaluate the individual and joint modalities on three challenging tasks: (i) inferring the temporal ordering of a set of videos; and (ii) action recognition.
arXiv Detail & Related papers (2020-04-05T14:02:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.