Few-Shot Action Localization without Knowing Boundaries
- URL: http://arxiv.org/abs/2106.04150v1
- Date: Tue, 8 Jun 2021 07:32:43 GMT
- Title: Few-Shot Action Localization without Knowing Boundaries
- Authors: Ting-Ting Xie, Christos Tzelepis, Fan Fu, Ioannis Patras
- Abstract summary: We show that it is possible to learn to localize actions in untrimmed videos when only one/few trimmed examples of the target action are available at test time.
We propose a network that learns to estimate Temporal Similarity Matrices (TSMs) that model a fine-grained similarity pattern between pairs of videos.
Our method achieves performance comparable or better to state-of-the-art fully-supervised, few-shot learning methods.
- Score: 9.959844922120523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning to localize actions in long, cluttered, and untrimmed videos is a
hard task, that in the literature has typically been addressed assuming the
availability of large amounts of annotated training samples for each class --
either in a fully-supervised setting, where action boundaries are known, or in
a weakly-supervised setting, where only class labels are known for each video.
In this paper, we go a step further and show that it is possible to learn to
localize actions in untrimmed videos when a) only one/few trimmed examples of
the target action are available at test time, and b) when a large collection of
videos with only class label annotation (some trimmed and some weakly annotated
untrimmed ones) are available for training; with no overlap between the classes
used during training and testing. To do so, we propose a network that learns to
estimate Temporal Similarity Matrices (TSMs) that model a fine-grained
similarity pattern between pairs of videos (trimmed or untrimmed), and uses
them to generate Temporal Class Activation Maps (TCAMs) for seen or unseen
classes. The TCAMs serve as temporal attention mechanisms to extract
video-level representations of untrimmed videos, and to temporally localize
actions at test time. To the best of our knowledge, we are the first to propose
a weakly-supervised, one/few-shot action localization network that can be
trained in an end-to-end fashion. Experimental results on THUMOS14 and
ActivityNet1.2 datasets, show that our method achieves performance comparable
or better to state-of-the-art fully-supervised, few-shot learning methods.
Related papers
- Open-Vocabulary Spatio-Temporal Action Detection [59.91046192096296]
Open-vocabulary-temporal action detection (OV-STAD) is an important fine-grained video understanding task.
OV-STAD requires training a model on a limited set of base classes with box and label supervision.
To better adapt the holistic VLM for the fine-grained action detection task, we carefully fine-tune it on the localized video region-text pairs.
arXiv Detail & Related papers (2024-05-17T14:52:47Z) - Enabling Weakly-Supervised Temporal Action Localization from On-Device
Learning of the Video Stream [5.215681853828831]
We propose an efficient video learning approach to learn from a long, untrimmed streaming video.
To the best of our knowledge, we are the first attempt to directly learn from the on-device, long video stream.
arXiv Detail & Related papers (2022-08-25T13:41:03Z) - Less than Few: Self-Shot Video Instance Segmentation [50.637278655763616]
We propose to automatically learn to find appropriate support videos given a query.
We tackle, for the first time, video instance segmentation in a self-shot (and few-shot) setting.
We provide strong baseline performances that utilize a novel transformer-based model.
arXiv Detail & Related papers (2022-04-19T13:14:43Z) - Unsupervised Pre-training for Temporal Action Localization Tasks [76.01985780118422]
We propose a self-supervised pretext task, coined as Pseudo Action localization (PAL) to Unsupervisedly Pre-train feature encoders for Temporal Action localization tasks (UP-TAL)
Specifically, we first randomly select temporal regions, each of which contains multiple clips, from one video as pseudo actions and then paste them onto different temporal positions of the other two videos.
The pretext task is to align the features of pasted pseudo action regions from two synthetic videos and maximize the agreement between them.
arXiv Detail & Related papers (2022-03-25T12:13:43Z) - Learning to Localize Actions from Moments [153.54638582696128]
We introduce a new design of transfer learning type to learn action localization for a large set of action categories.
We present Action Herald Networks (AherNet) that integrate such design into an one-stage action localization framework.
arXiv Detail & Related papers (2020-08-31T16:03:47Z) - Generalized Few-Shot Video Classification with Video Retrieval and
Feature Generation [132.82884193921535]
We argue that previous methods underestimate the importance of video feature learning and propose a two-stage approach.
We show that this simple baseline approach outperforms prior few-shot video classification methods by over 20 points on existing benchmarks.
We present two novel approaches that yield further improvement.
arXiv Detail & Related papers (2020-07-09T13:05:32Z) - TAEN: Temporal Aware Embedding Network for Few-Shot Action Recognition [10.07962673311661]
We present Aware Temporal Embedding Network (TAEN) for few-shot action recognition.
We demonstrate the effectiveness of TAEN on two few shot tasks, video classification and temporal action detection.
With training of just a few fully connected layers we reach comparable results to prior art on both few shot video classification and temporal detection tasks.
arXiv Detail & Related papers (2020-04-21T16:32:10Z) - Revisiting Few-shot Activity Detection with Class Similarity Control [107.79338380065286]
We present a framework for few-shot temporal activity detection based on proposal regression.
Our model is end-to-end trainable, takes into account the frame rate differences between few-shot activities and untrimmed test videos, and can benefit from additional few-shot examples.
arXiv Detail & Related papers (2020-03-31T22:02:38Z) - Weakly Supervised Temporal Action Localization Using Deep Metric
Learning [12.49814373580862]
We propose a weakly supervised temporal action localization method that only requires video-level action instances as supervision during training.
We jointly optimize a balanced binary cross-entropy loss and a metric loss using a standard backpropagation algorithm.
Our approach improves the current state-of-the-art result for THUMOS14 by 6.5% mAP at IoU threshold 0.5, and achieves competitive performance for ActivityNet1.2.
arXiv Detail & Related papers (2020-01-21T22:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.